Data624_Project1

Part A

Objective

In part A, I will be forecasting the cash withdrawals from 4 different ATM machines for May 2010.

Initial Exploration

A few things that stand out is that the ATM and Date columns are in the wrong format. I also observed that the Cash column contains 19 nulls.

summary(ATM)

##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10919.8  
##                                     NA's   :19

glimpse(ATM)

## Rows: 1,474
## Columns: 3
## $ DATE <dbl> 39934, 39934, 39935, 39935, 39936, 39936, 39937, 39937, 39938, 39…
## $ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
## $ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …

Data Preparation

I start by transforming the Date and ATM columns to their appropriate data type. The Date column is turned to a proper Date format. The ATM column is turned from a character type to a factor type.

To address the nulls, I first wanted to see which ATM(s) had missing values.

ATM <- ATM %>%
  mutate(DATE = as.Date(DATE, origin = "1899-12-30")) %>%
  mutate(ATM = as.factor(ATM)) %>%
  filter(!is.na(ATM))

Upon further inspection, we can see that there are a few dates where there were Null cash withdrawls for ATM 1 and 2. Because of how few missing values there are, I’m going to assume that no withdrawls took place. I will replace the Nulls with zeros.

ATM %>%
  filter(is.na(Cash)) %>%
  select(ATM, DATE, Cash)

## # A tibble: 5 × 3
##   ATM   DATE        Cash
##   <fct> <date>     <dbl>
## 1 ATM1  2009-06-13    NA
## 2 ATM1  2009-06-16    NA
## 3 ATM2  2009-06-18    NA
## 4 ATM1  2009-06-22    NA
## 5 ATM2  2009-06-24    NA

ATM <- ATM %>%
  mutate(Cash = replace_na(Cash, 0))

Next I transform the ATM dataframe into a tsibble dataset so that I can perform time series analysis.

ATM_ts <- ATM %>%
  as_tsibble(index = DATE, key = ATM)

Further Exploration

Now that the data is prepared, we can look at the withdrawal patterns for each ATM Machine.

ATM 1 doesn’t display any clear trend however, we can see there are seasonal patterns. There are some peaks and falls that stick out.

ATM_ts %>%
  filter(ATM == "ATM1") %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

In ATM 2, we can also see an obscence of a clear trend and seasonal patterns. The peaks and falls are very consistent and fall within the same range throughout the plot.

ATM_ts %>%
  filter(ATM == "ATM2") %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

In ATM 3, it’s difficult to identify any patterns because there are very little withdrawls.

ATM_ts %>%
  filter(ATM == "ATM3") %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

ATM 4 has a very stable plot with the exception of an outlier high around 2010. There is no clear trend, and we can observe seasonality.

ATM_ts %>%
  filter(ATM == "ATM4") %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

Model Building

To create our models, I’m going to separate the data for all 4 ATMs into their own tsibble object. Then I will examine the ACF and PACF plots to determine which forecast model to utilize.

ATM1 <- ATM_ts %>%
  filter(ATM == 'ATM1')

ATM2 <- ATM_ts %>%
  filter(ATM == 'ATM2')

ATM3 <- ATM_ts %>%
  filter(ATM == 'ATM3')

ATM4 <- ATM_ts %>%
  filter(ATM == 'ATM4')

ATM 1

When looking at the ACF and PACF plots for ATM1 we can observe seasonality. Given this observation, a SARIMA model can accurately capture these type of patterns.

Before creating our models, we will first perform differencing and attempt to make the data stationary - an important step before creating our SARIMA model.

For the SARIMA models, I will create two different models. One by using the ARIMA function which will automatically determine the best parameters and a second model where I will manually select the parameters. Then I will compare metrics and pick the best forecasting model for each ATM dataset.

ATM_ts %>%
  filter(ATM == 'ATM1') %>%
  gg_tsdisplay(Cash, plot_type = 'partial', lag = 180)

# perform differencing
ATM1 <- ATM1 %>%
  mutate(Cash_diff = difference(Cash, lag = 7))

We can see that the data is not fully stationary. The ARIMA function will capture the patterns that remain.

# double check to see if data is stationary
ATM1 %>%
  gg_tsdisplay(Cash_diff, plot_type = 'partial', lag = 180)

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).

The custom Arima1 with the parameters I set is the better model. It has lower ME, RMSE, and MAE. I will select this as our forecasting model.

model1 <- ATM1 %>%
  model(
    arima = ARIMA(Cash),
    arima1 = ARIMA(Cash ~ pdq(1,0,1) + PDQ(1,1,1, period = 7))
)

accuracy(model1)

## # A tibble: 2 × 11
##   ATM   .model .type         ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <fct> <chr>  <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM1  arima  Training -0.103   25.9  16.3  -Inf   Inf 0.847 0.831  0.0650
## 2 ATM1  arima1 Training -0.0600  25.7  16.1  -Inf   Inf 0.837 0.824 -0.0153

forecasts1 <- model1 %>%
  forecast(h = 31) 

autoplot(forecasts1, ATM1)

ATM 2

We will take the same approach for ATM1. We’ll try to address the seasonality through differencing and then create two SARIMA models: One created through the ARIMA function and another with parameters that I set.

# acf and pacf prior to differencing
ATM_ts %>%
  filter(ATM == 'ATM2') %>%
  gg_tsdisplay(Cash, plot_type = 'partial', lag = 180)

ATM2 <- ATM2 %>%
  mutate(Cash_diff = difference(Cash, lag = 7))

# acf and pacf after differencing
ATM2 %>%
  gg_tsdisplay(Cash_diff, plot_type = 'partial', lag = 180)

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).

The model named arima has a lower RMSE, MAE and RMSE. I will choose this as our forecasting model for ATM2.

model2 <- ATM2 %>%
  model(
    arima = ARIMA(Cash),
    arima1 = ARIMA(Cash ~ pdq(1,0,1) + PDQ(1,1,1, period = 7))
)

accuracy(model2)

## # A tibble: 2 × 11
##   ATM   .model .type        ME  RMSE   MAE   MPE  MAPE  MASE RMSSE      ACF1
##   <fct> <chr>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
## 1 ATM2  arima  Training -0.854  23.9  16.8  -Inf   Inf 0.822 0.802  0.000718
## 2 ATM2  arima1 Training -0.349  24.6  17.1  -Inf   Inf 0.835 0.828 -0.0247

forecasts2 <- model2 %>%
  forecast(h = 31) 

autoplot(forecasts2, ATM2)

ATM 3

Given the lack of data for ATM3, and lack of clear seasonality or trend, I will utilize a Naive model. The Naive model assumes that the most recent observation is the best predictor of the following value.

ATM_ts %>%
  filter(ATM == 'ATM3') %>%
  gg_tsdisplay(Cash, plot_type = 'partial', lag = 180)

model3 <- ATM3 %>%
  model(Naive = NAIVE(Cash))

accuracy(model3)

## # A tibble: 1 × 11
##   ATM   .model .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <fct> <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM3  Naive  Training 0.234  5.09 0.310  28.8  40.2 0.423 0.632 -0.149

forecasts3 <- model3 %>%
  forecast(h = 31) 

autoplot(forecasts3, ATM3)

ATM 4

The Arima1 model outperforms the Arima model in the RMSE, MAPE and MASE metrics. Given that it has less bias and is less error prone, I will select this as our forecasting model.

ATM_ts %>%
  filter(ATM == 'ATM4') %>%
  gg_tsdisplay(Cash, plot_type = 'partial', lag = 180)

model4 <- ATM4 %>%
  model(
    arima = ARIMA(Cash),
    arima1 = ARIMA(Cash ~ pdq(1,0,1) + PDQ(1,1,1, period = 7)+ 0)
)

## Warning in sqrt(diag(best$var.coef)): NaNs produced

accuracy(model4)

## # A tibble: 2 × 11
##   ATM   .model .type           ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <fct> <chr>  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ATM4  arima  Training -1.51e-10  650.  324. -617.  647. 0.805 0.725 -0.00936
## 2 ATM4  arima1 Training  1.59e+ 1  634.  288. -542.  576. 0.716 0.708 -0.00774

forecasts4 <- model4 %>%
  forecast(h = 31) 

autoplot(forecasts4, ATM4)

Lastly, I will combine the forecasted values into one dataframe to export.

combined_forecasts <- bind_rows(
  forecasts1 %>% as_tibble() %>% select(DATE, ATM, .mean),
  forecasts2 %>% as_tibble() %>% select(DATE, ATM, .mean),
  forecasts3 %>% as_tibble() %>% select(DATE, ATM, .mean),
  forecasts4 %>% as_tibble() %>% select(DATE, ATM, .mean)
)

Part B

Objective

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. I will model this data and give a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours.

Initial Exploration

At first glance, I can see that the date column labeled ‘YYYY-MMM’is a charactrer type and needs to be changed to a date type. I also noticed that on one of the days there is a KWH reading of NA. We will need to look at the missing value and transformed the ’YYYY-MMM’ column.

summary(Power)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

glimpse(Power)

## Rows: 192
## Columns: 3
## $ CaseSequence <dbl> 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 74…
## $ `YYYY-MMM`   <chr> "1998-Jan", "1998-Feb", "1998-Mar", "1998-Apr", "1998-May…
## $ KWH          <dbl> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 891…

There is a missing recording of KWH for September 2008. Since it’s only one missing value and given the large data we have going back to 1998, we can use the averages for the month of September as a placeholder for September 2008.

Power %>%
  filter(is.na(KWH))

## # A tibble: 1 × 3
##   CaseSequence `YYYY-MMM`   KWH
##          <dbl> <chr>      <dbl>
## 1          861 2008-Sep      NA

Data Preparation

In this section, we will convert the ‘YYYY-MMM’ column into a date column and turn ‘Power’ into a tsibble object. I will also get the average KWH value for September and replace the missing value from September 2008 with this average.

Power <- Power %>%
  mutate(Date = yearmonth(`YYYY-MMM`)) %>%
  select(-`YYYY-MMM`) %>%
  as_tsibble()

## Using `Date` as index variable.

# look at what values are like historically for Sept
Sept <- Power %>%
  filter(month(Date) == 9 & !is.na(KWH)) 

autoplot(Sept, KWH)

september_avg <- Power %>%
  filter(month(Date) == 9 & !is.na(KWH)) %>%
  summarize(mean_kwh = mean(KWH, na.rm = TRUE)) %>%
  pull(mean_kwh)


Power$KWH[Power$Date == yearmonth("2008-Sep") & is.na(Power$KWH)] <- september_avg

## Warning in Power$KWH[Power$Date == yearmonth("2008-Sep") & is.na(Power$KWH)] <-
## september_avg: number of items to replace is not a multiple of replacement
## length

Model Building

When looking at the overall timeseries data for Power, we can see that there is no visible trend. We can see a consistent seasonal pattern. There is also a noticeable dip near January 2010.

I will address the seasonality by performing differencing. I will also utilize a SARIMA model to properly capture the seasonal pattern. Similarly like in Part A, I will use a baseline Sarima model using the ARIMA function and a second model that I will set the parameters.

autoplot(Power, KWH)

Power %>%
 gg_tsdisplay(KWH, plot_type = 'partial', lag = 24)

Power <- Power %>%
  mutate(KWH_diff = difference(KWH, lag = 12))

Power %>%
 gg_tsdisplay(KWH_diff, plot_type = 'partial', lag = 24)

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

The baseline arima model performs better across metrics like RMSE, MAE and MAPE.

model5 <- Power %>%
  model(
    arima = ARIMA(KWH_diff),
    arima1 = ARIMA(KWH_diff ~ pdq(1,0,1) + PDQ(1,0,1, period = 12) + 0)
)

accuracy(model5)

## # A tibble: 2 × 10
##   .model .type         ME    RMSE     MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr>  <chr>      <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 arima  Training -11277. 871906. 539503. -2459. 2672. 0.439 0.418 -0.00475
## 2 arima1 Training 139963. 874915. 549053. -2657. 2983. 0.447 0.420  0.00403

forecasts5 <- model5 %>%
  forecast(h = 12)

autoplot(forecasts5, Power)

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

Data624_Project1

Christian Uriostegui

2024-10-26

Part A

Objective

Initial Exploration

Data Preparation

Further Exploration

Model Building

ATM 1

ATM 2

ATM 3

ATM 4

Part B

Objective

Initial Exploration

Data Preparation

Model Building