DATA624 - Project 1

Part 1 - ATM

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Data Preparation and Exploration

Loading the data from the provided excel file.

atm <- read_excel('ATM624Data.xlsx')

head(atm)

## # A tibble: 6 x 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

Looking at the ATM dataset, we see that the date transferred from the excel workbook is not in date format. Let’s convert the DATE variable and conver the dataframe into a tsibble object for further time series analysis.

atm <- atm %>%
  mutate(DATE = as.Date(atm$DATE, origin = '1899-12-30'),
        ATM = as.factor(ATM))

Taking a look at the missing values.

gg_miss_upset(atm)

We will separate the data into data frames by ATM type. This will prepare the data to be imputed based on data from its own ATM type.

# seperating by ATM type
atm1 <- atm %>%
  filter(ATM == 'ATM1')

atm2 <- atm %>%
  filter(ATM == 'ATM2')

atm3 <- atm %>% 
  filter(ATM == 'ATM3')

atm4 <- atm %>%
  filter(ATM == 'ATM4')

# created a function to impute the datasets
impute_df <- function(x){
  
  init <- mice(x, maxit = 0)
  predM <- init$predictorMatrix
  set.seed(123)
  imputed <- mice(x, method = 'rf', 
                  predictorMatrix = predM, m = 5)
  
  x <- complete(imputed)
  
  return(x)
}

# calling and storing the imputation function
atm1 <- impute_df(atm1)

## Warning: Number of logged events: 1

## 
##  iter imp variable
##   1   1  Cash
##   1   2  Cash
##   1   3  Cash
##   1   4  Cash
##   1   5  Cash
##   2   1  Cash
##   2   2  Cash
##   2   3  Cash
##   2   4  Cash
##   2   5  Cash
##   3   1  Cash
##   3   2  Cash
##   3   3  Cash
##   3   4  Cash
##   3   5  Cash
##   4   1  Cash
##   4   2  Cash
##   4   3  Cash
##   4   4  Cash
##   4   5  Cash
##   5   1  Cash
##   5   2  Cash
##   5   3  Cash
##   5   4  Cash
##   5   5  Cash

atm2 <- impute_df(atm2)

## Warning: Number of logged events: 1

## 
##  iter imp variable
##   1   1  Cash
##   1   2  Cash
##   1   3  Cash
##   1   4  Cash
##   1   5  Cash
##   2   1  Cash
##   2   2  Cash
##   2   3  Cash
##   2   4  Cash
##   2   5  Cash
##   3   1  Cash
##   3   2  Cash
##   3   3  Cash
##   3   4  Cash
##   3   5  Cash
##   4   1  Cash
##   4   2  Cash
##   4   3  Cash
##   4   4  Cash
##   4   5  Cash
##   5   1  Cash
##   5   2  Cash
##   5   3  Cash
##   5   4  Cash
##   5   5  Cash

atm3 <- impute_df(atm3)

## Warning: Number of logged events: 1

## 
##  iter imp variable
##   1   1
##   1   2
##   1   3
##   1   4
##   1   5
##   2   1
##   2   2
##   2   3
##   2   4
##   2   5
##   3   1
##   3   2
##   3   3
##   3   4
##   3   5
##   4   1
##   4   2
##   4   3
##   4   4
##   4   5
##   5   1
##   5   2
##   5   3
##   5   4
##   5   5

atm4 <- impute_df(atm4)

## Warning: Number of logged events: 1

## 
##  iter imp variable
##   1   1
##   1   2
##   1   3
##   1   4
##   1   5
##   2   1
##   2   2
##   2   3
##   2   4
##   2   5
##   3   1
##   3   2
##   3   3
##   3   4
##   3   5
##   4   1
##   4   2
##   4   3
##   4   4
##   4   5
##   5   1
##   5   2
##   5   3
##   5   4
##   5   5

After the imputation will transform the object into a tsibble object. Furthermore, we will fill in any missing dates in the tsibble object with the median Cash value. This will have the least impact on the distribution of the time series data.

ts_object <- function(x){
  x <- x %>% 
  as_tsibble(key = ATM,
             index = DATE) %>%
  group_by_key() %>%
  fill_gaps(Cash = median(Cash))
  
  return(x)
}

atm1 <- ts_object(atm1)
atm2 <- ts_object(atm2)
atm3 <- ts_object(atm3)
atm4 <- ts_object(atm4)

Once, the data is in appropriate form for time series analysis, we can take a glance at the data.

atm1 %>%
  autoplot() +
  labs(title = 'ATM1')

## Plot variable not specified, automatically selected `.vars = Cash`

atm2 %>%
  autoplot() +
  labs(title = 'ATM2')

## Plot variable not specified, automatically selected `.vars = Cash`

atm3 %>%
  autoplot() +
  labs(title = 'ATM3')

## Plot variable not specified, automatically selected `.vars = Cash`

atm4 %>%
  autoplot() +
  labs(title = 'ATM4')

## Plot variable not specified, automatically selected `.vars = Cash`

After looking at the data, we want to normalize it before creating a model. Here, we perform the Box Cox transformation method on all datasets other than ATM3 as most variables are 0 and the transformation will produce a non valid answer.

cash_boxcox <- function(x){
  lambda <- x %>%
    features(Cash, features = guerrero) %>%
    pull(lambda_guerrero)
  
  # need lambda values to inverse Box Cox transformation
  print(lambda)
  x <- x %>%
    mutate(Cash = box_cox(Cash, lambda))

  return(x)
}

atm1 <- cash_boxcox(atm1)

## [1] 0.3673124

atm2 <- cash_boxcox(atm2)

## [1] 0.6958635

atm4 <- cash_boxcox(atm4)

## [1] -0.1228524

lambdas <- c(0.3673124,0.6958635,-0.1228524)

ATM1

Looking at ATM1 more closely. Lags at 7, 14, and 21 indicate that there is a seasonal pattern.

atm1 %>%
  gg_tsdisplay(Cash, plot_type = 'partial')

Here we will create a models using the Exponential Smoothing, ARIMA, and Seasonal Naive methods to forecast a months period. We will compare the results on the existing data and choose the best performing model to predict the Cash variable for ATM1 for May 2010.

atm1_train <- atm1 %>%
  filter_index('2009-05-01' ~ '2010-03-31') 

atm1_fit <- atm1_train %>%
  model(
    ETS = ETS(Cash),
    ARIMA = ARIMA(Cash),
    SNAIVE = SNAIVE(Cash)
  ) 

atm1_fc <- atm1_fit %>% forecast(h = 30)

atm1_fc %>%
  autoplot(atm1_train, level = NULL) + 
  autolayer(
    filter_index(atm1, '2010-04-01' ~ .),
    color = 'black'
  ) +
  labs(
    title = 'Forecasts for Daily Cash Production by ATM1'
  ) +
  guides(color = guide_legend(title = 'Forecast'))

## `mutate_if()` ignored the following grouping variables:
## Column `ATM`

## Plot variable not specified, automatically selected `.vars = Cash`

Taking a look at the model’s metrics to see its performance on the unused data.

accuracy(atm1_fc, atm1)

## # A tibble: 3 x 11
##   .model ATM   .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <fct> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA  ATM1  Test  -0.436 0.962 0.676 -19.8  21.9 0.508 0.402 -0.307 
## 2 ETS    ATM1  Test  -0.456 0.966 0.675 -20.2  22.1 0.508 0.403 -0.338 
## 3 SNAIVE ATM1  Test  -0.746 1.59  1.14  -34.6  38.0 0.859 0.665 -0.0339

The generated ARIMA model performs the best when looking at all metrics, having the lowest values in each.

We will use the Ljung-Box test on the residuals from the model, to conclude which models have residuals that are not distinguishable from white noise.

atm1_fit %>%
  select(ARIMA) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 42, dof = 1)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     30.4     0.887

atm1_fit %>%
  select(ETS) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 42, dof = 1)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ETS       38.7     0.575

atm1_fit %>%
  select(SNAIVE) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 42, dof = 1)

## # A tibble: 1 x 3
##   .model bp_stat  bp_pvalue
##   <chr>    <dbl>      <dbl>
## 1 SNAIVE    97.7 0.00000156

The ARIMA model performed the best and passed the residuals test. Therefore, this is a usable model with the residuals being uncorrelated and innovation residuals have zero mean.

atm1_fit %>%
  select(ARIMA) %>%
  report()

## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1146  -0.1059  -0.6543
## s.e.  0.0549   0.0551   0.0474
## 
## sigma^2 estimated as 4.109:  log likelihood=-697.65
## AIC=1403.3   AICc=1403.42   BIC=1418.47

atm1_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals()

Using the ARIMA model to forecast May 2010 Cash data. Next, we will inverse the Box Cox transformation using the stored lambda variables.

atm1_fit <- atm1 %>%
  model(ARIMA(Cash))

atm1_fc <- atm1_fit %>% forecast(h = 31)

atm1_fc %>%
  autoplot() + 
  autolayer(atm1)

## Plot variable not specified, automatically selected `.vars = Cash`

atm1_fc_may2010 <- atm1_fc %>%
  as.data.frame() %>%
  select(DATE, ATM, .mean)

atm1_fc_may2010$Cash <- inv_box_cox(atm1_fc_may2010$.mean, lambdas[1])

atm1_fc_may2010 <- atm1_fc_may2010 %>% select(-.mean)

ATM2

Looking at ATM2 more closely. Large lag spikes at 7, 14, and 21 and other spikes in between indicating that there is a seasonal pattern.

atm2 %>%
  gg_tsdisplay(Cash, plot_type = 'partial')

atm2_train <- atm2 %>%
  filter_index('2009-05-01' ~ '2010-03-30') 

atm2_fit <- atm2_train %>%
  model(
    ETS = ETS(Cash),
    ARIMA = ARIMA(Cash),
    SNAIVE = SNAIVE(Cash)
  ) 

atm2_fc <- atm2_fit %>% forecast(h = 30)

atm2_fc %>%
  autoplot(atm2_train, level = NULL) + 
  autolayer(
    filter_index(atm2, '2010-04-01' ~ .),
    color = 'black'
  ) +
  labs(
    title = 'Forecasts for Daily Cash Production by ATM2'
  ) +
  guides(color = guide_legend(title = 'Forecast'))

## `mutate_if()` ignored the following grouping variables:
## Column `ATM`

## Plot variable not specified, automatically selected `.vars = Cash`

Taking a look at the model’s metrics to see its performance on the unused data.

accuracy(atm2_fc, atm2)

## # A tibble: 3 x 11
##   .model ATM   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA  ATM2  Test   2.57  5.78  4.51  -Inf   Inf 0.700 0.605  0.0358
## 2 ETS    ATM2  Test   3.01  5.44  4.11  -Inf   Inf 0.638 0.570 -0.189 
## 3 SNAIVE ATM2  Test   4.92  8.42  6.23   Inf   Inf 0.967 0.882 -0.440

The generated ETS model performs the best when looking at all metrics other than the Mean Error (ME), having the lowest values in each.

Using the Ljung-Box test on the residuals from each model, we can conclude which models have the residuals that are not distinguishable from white noise.

atm2_fit %>%
  select(ARIMA) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 42, dof = 1)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     38.9     0.565

atm2_fit %>%
  select(ETS) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 42, dof = 1)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ETS       62.9    0.0154

atm2_fit %>%
  select(SNAIVE) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 42, dof = 1)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 SNAIVE    134.  9.95e-12

Although the ETS model perofrmed the best, it did not pass the residual test. Instead, we will use the ARIMA test as this is a usable model with the residuals being uncorrelated and innovation residuals have zero mean.

atm2_fit %>%
  select(ARIMA) %>%
  report()

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4209  -0.9220  0.4598  0.8095  -0.7540
## s.e.   0.0537   0.0445  0.0860  0.0593   0.0461
## 
## sigma^2 estimated as 61.72:  log likelihood=-1137.92
## AIC=2287.84   AICc=2288.1   BIC=2310.58

atm2_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals()

Using the ETS model to forecast May 2010 Cash data. Next, we will inverse the Box Cox transformation using the stored lambda variables.

atm2_fit <- atm2 %>%
  model(ARIMA(Cash))

atm2_fc <- atm2_fit %>% forecast(h = 31)

atm2_fc %>%
  autoplot() + 
  autolayer(atm2)

## Plot variable not specified, automatically selected `.vars = Cash`

atm2_fc_may2010 <- atm2_fc %>%
  as.data.frame() %>%
  select(DATE, ATM, .mean)

atm2_fc_may2010$Cash <- inv_box_cox(atm2_fc_may2010$.mean, lambdas[2])

atm2_fc_may2010 <- atm2_fc_may2010 %>% select(-.mean)

ATM3

Looking at ATM3 more closely. We notice that there is only 3 entries for the Cash variable in the time series data.

atm3 %>%
  gg_tsdisplay(Cash, plot_type = 'partial')

First date that has a value that is not 0 is 2010-04-08. We will filter our dataset beginning with this date.

temp <- atm3 %>%
  filter(Cash != 0)

min(temp$DATE)

## [1] "2010-04-28"

Taking a second look at the abbrevaited dataset.

atm3 <- atm3 %>%
  filter_index('2010-04-28' ~ .)
  

atm3 %>%
  gg_tsdisplay(Cash, plot_type = 'partial')

Here we will create a models using the Mean, Naive, and Drift methods to forecast a months period as the time series data is limited. We will compare the results on the existing data and choose the best performing model to predict the Cash variable for ATM3 for May 2010.

atm3_train <- atm3 %>%
  filter_index('2009-04-28' ~ '2010-04-29') 

atm3_fit <- atm3_train %>%
  model(
    Mean = MEAN(Cash),
    Naive = NAIVE(Cash),
    Drift = NAIVE(Cash ~ drift())
  ) 

atm3_fc <- atm3_fit %>% forecast(h = 1)

atm3_fc %>%
  autoplot(atm3_train, level = NULL) + 
  autolayer(
    atm3,
    color = 'black'
  ) +
  labs(
    y = 'Cash (in hundreds)',
    title = 'Forecasts for Daily Cash Production by ATM3'
  ) +
  guides(color = guide_legend(title = 'Forecast'))

## `mutate_if()` ignored the following grouping variables:
## Column `ATM`

## Plot variable not specified, automatically selected `.vars = Cash`

Taking a look at the model’s metrics to see its performance on the unused data.

accuracy(atm3_fc, atm3)

## Warning: 1 error encountered
## [1] subscript out of bounds

## Warning: 1 error encountered
## [1] subscript out of bounds

## Warning: 1 error encountered
## [1] subscript out of bounds

## # A tibble: 3 x 11
##   .model ATM   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Drift  ATM3  Test     17    17    17 20    20      NaN   NaN    NA
## 2 Mean   ATM3  Test     -4     4     4 -4.71  4.71   NaN   NaN    NA
## 3 Naive  ATM3  Test      3     3     3  3.53  3.53   NaN   NaN    NA

The generated NAIVE model performs the best when looking at all metrics. Taking a look at the residuals does not seem appropriate in this situation when there is only one predicted value.

Using the NAIVE model to forecast May 2010 Cash data.

atm3_fit <- atm3 %>%
  model(NAIVE(Cash))

atm3_fc <- atm3_fit %>% forecast(h = 31)

atm3_fc %>%
  autoplot() + 
  autolayer(atm3)

## Plot variable not specified, automatically selected `.vars = Cash`

atm3_fc_may2010 <- atm3_fc %>%
  as.data.frame() %>%
  select(DATE, ATM, .mean)

atm3_fc_may2010$Cash <- atm3_fc_may2010$.mean

atm3_fc_may2010 <- atm3_fc_may2010 %>% select(-.mean)

ATM4

Looking at ATM4 more closely. Lags at 7, 14, and 21 indicate that there is a seasonal pattern.

atm4 %>%
  gg_tsdisplay(Cash, plot_type = 'partial')

atm4_train <- atm4 %>%
  filter_index('2009-05-01' ~ '2010-03-31') 

atm4_fit <- atm4_train %>%
  model(
    ETS = ETS(Cash),
    ARIMA = ARIMA(Cash),
    SNAIVE = SNAIVE(Cash)
  ) 

atm4_fc <- atm4_fit %>% forecast(h = 30)

atm4_fc %>%
  autoplot(atm4_train, level = NULL) + 
  autolayer(
    filter_index(atm4, '2010-04-01' ~ .),
    color = 'black'
  ) +
  labs(
    title = 'Forecasts for Daily Cash Production by ATM4'
  ) +
  guides(color = guide_legend(title = 'Forecast'))

## `mutate_if()` ignored the following grouping variables:
## Column `ATM`

## Plot variable not specified, automatically selected `.vars = Cash`

Taking a look at the model’s metrics to see its performance on the unused data.

accuracy(atm4_fc, atm4)

## # A tibble: 3 x 11
##   .model ATM   .type      ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>  <fct> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ARIMA  ATM4  Test  -0.137  0.832 0.580 -16.8  26.8 0.897 0.898 -0.158  
## 2 ETS    ATM4  Test  -0.0794 0.975 0.686 -17.5  31.3 1.06  1.05  -0.00348
## 3 SNAIVE ATM4  Test  -0.335  0.725 0.425 -18.0  20.3 0.659 0.782 -0.0120

The generated SNAIVE model performs the best when looking at all metrics other than the Mean Error (ME), having the lowest values in each.

Using the Ljung-Box test on the residuals from each model, we can conclude which models have residuals that are not distinguishable from white noise.

atm4_fit %>%
  select(ARIMA) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 42, dof = 1)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     57.0    0.0492

atm4_fit %>%
  select(ETS) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 42, dof = 1)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ETS       51.9     0.117

atm4_fit %>%
  select(SNAIVE) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 42, dof = 1)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 SNAIVE    143.  3.29e-13

Although the SNAIVE model performed the best, the ETS model is the only one that passed the residual test. We will use the ETS model as it is a usable model with the residuals being uncorrelated and innovation residuals have zero mean.

atm4_fit %>%
  select(ETS) %>%
  gg_tsresiduals()

atm4_fit %>%
  select(ETS) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 42, dof = 1)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ETS       51.9     0.117

Using the SNAIVE model to forecast May 2010 Cash data. Next, we will inverse the Box Cox transformation using the stored lambda variables.

atm4_fit <- atm4 %>%
  model(ETS(Cash))

atm4_fc <- atm4_fit %>% forecast(h = 31)

atm4_fc %>%
  autoplot() + 
  autolayer(atm4)

## Plot variable not specified, automatically selected `.vars = Cash`

atm4_fc_may2010 <- atm4_fc %>%
  as.data.frame() %>%
  select(DATE, ATM, .mean)

atm4_fc_may2010$Cash <- inv_box_cox(atm4_fc_may2010$.mean, lambdas[3])

atm4_fc_may2010 <- atm4_fc_may2010 %>% select(-.mean)

Combining all forecast datasets by ATM.

atm_may2010 <- rbind(atm1_fc_may2010, atm2_fc_may2010, atm3_fc_may2010, atm4_fc_may2010)

write_xlsx(atm_may2010, 'ATM May 2010 Forecast.xlsx')

Part B - Forecasting Power, Residential Customer Forecast

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Data Preparation and Exploration

Loading the data from the provided excel file.

power <- read_excel('ResidentialCustomerForecastLoad-624.xlsx')

Taking a look at the missing data. There’s is only one missing value in the KWH column. We will use the previously created function to impute the missing value. I am choosing this way as there is seasonality to the data and should be taking into account.

# this shows that there is only one missing value
summary(power)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

power <- impute_df(power)

## Warning: Number of logged events: 1

## 
##  iter imp variable
##   1   1  KWH
##   1   2  KWH
##   1   3  KWH
##   1   4  KWH
##   1   5  KWH
##   2   1  KWH
##   2   2  KWH
##   2   3  KWH
##   2   4  KWH
##   2   5  KWH
##   3   1  KWH
##   3   2  KWH
##   3   3  KWH
##   3   4  KWH
##   3   5  KWH
##   4   1  KWH
##   4   2  KWH
##   4   3  KWH
##   4   4  KWH
##   4   5  KWH
##   5   1  KWH
##   5   2  KWH
##   5   3  KWH
##   5   4  KWH
##   5   5  KWH

Loading the time series data from excel creates an issue with the date variables. Before creating the variable into a tsibble object we need to convert the date variable in a date format.

power <- power %>%
   mutate(Month = yearmonth(as_date(paste(`YYYY-MMM`, '01', sep='-'), format='%Y-%b-%d'))) %>%
  select(Month, KWH) %>%
  as_tsibble(index = Month)

With that, we can take a look at the time series data.

power %>%
  gg_tsdisplay()

## Plot variable not specified, automatically selected `y = KWH`

The graph shows that there is seasonality and a slight upward trend pattern in the time series data.

Before creating a forecasting model, we will perform a Box Cox transformation to the time series.

lambda <- power %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

power <- power %>%
  mutate(KWH = box_cox(KWH, lambda))

Here we will create a models using the Exponential Smoothing, ARIMA, and Seasonal Naive methods to forecast a years period. We will compare the results on the existing data and choose the best performing model to predict the Cash variable for KWH for 2014.

power_train <- power %>%
  filter_index(. ~ '2012-12-31') 

power_fit <- power_train %>%
  model(
    ETS = ETS(KWH),
    ARIMA = ARIMA(KWH),
    SNAIVE = SNAIVE(KWH)
  ) 

power_fc <- power_fit %>% forecast(h = 12)

power_fc %>%
  autoplot(power_train, level = NULL) + 
  autolayer(
    filter_index(power, '2013-01-01' ~ .),
    color = 'black'
  ) +
  labs(
    title = 'Forecasts for Daily Cash Production by power'
  ) +
  guides(color = guide_legend(title = 'Forecast'))

## Plot variable not specified, automatically selected `.vars = KWH`

Taking a look at the model’s metrics to see its performance on the unused data.

accuracy(power_fc, power)

## # A tibble: 3 x 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA  Test  0.633 1.33  0.979 1.20   1.90 1.15  0.693 -0.0734
## 2 ETS    Test  0.744 1.14  0.852 1.45   1.67 1.00  0.594  0.164 
## 3 SNAIVE Test  0.388 0.934 0.571 0.751  1.11 0.671 0.487 -0.0318

The generated SNAIVE model performs the best when looking at the all metrics.

We will use the Ljung-Box test on the residuals from each model to conclude which models have residuals are not distinguishable from a white noise.

power_fit %>%
  select(ARIMA) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 24, dof = 0)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     41.3    0.0153

power_fit %>%
  select(ARIMA) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 24, dof = 0)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     41.3    0.0153

power_fit %>%
  select(ETS) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 24, dof = 0)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ETS       12.6     0.972

The Ljung-Box test concludes that the ETS model has residuals that are not distinguishable from white noise. Therefore, this is a usable model with the residuals being uncorrelated and innovation residuals have zero mean. Even though this model did not perform the best in the accuracy test we will use this model based on the residual test.

power_fit %>%
  select(ETS) %>%
  report()

## Series: KWH 
## Model: ETS(M,N,A) 
##   Smoothing parameters:
##     alpha = 0.03655036 
##     gamma = 0.0001000306 
## 
##   Initial states:
##      l[0]       s[0]    s[-1]      s[-2]    s[-3]    s[-4]     s[-5]     s[-6]
##  48.81389 -0.3195776 -1.78057 -0.7923634 1.501721 1.983819 0.8178475 0.2048149
##     s[-7]     s[-8]      s[-9]    s[-10]   s[-11]
##  -1.82363 -1.310447 -0.5583567 0.4978488 1.578893
## 
##   sigma^2:  7e-04
## 
##      AIC     AICc      BIC 
## 1050.999 1053.925 1098.893

power_fit %>%
  select(ETS) %>%
  gg_tsresiduals()

Using the ETS model to forecast 2014 KWH data. Next, we will inverse the Box Cox transformation using the stored lambda variables.

power_fit <- power %>%
  model(ETS(KWH))

power_fc <- power_fit %>% forecast(h = 12)

power_fc %>%
  autoplot() + 
  autolayer(power)

## Plot variable not specified, automatically selected `.vars = KWH`

power_fc_2014 <- power_fc %>%
  as.data.frame() %>%
  select(Month, KWH, .mean)

power_fc_2014$KWH <- inv_box_cox(power_fc_2014$.mean, lambda)

power_fc_2014 <- power_fc_2014 %>% select(-.mean) %>%
  mutate(Month = as.POSIXct(Month))

We will now save the dataset of the forecasted values into an excel file.

write_xlsx(power_fc_2014, 'Residential Customer Forecast 2014.xlsx')

Part C - Waterflow Pipe

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Data Preparation and Exploration

Loading the dataset from the provided excel files.

water1 <- read_excel('Waterflow_Pipe1.xlsx')
water2 <- read_excel('Waterflow_Pipe2.xlsx')

We need to manipulate the date variable carried over from the excel file. Also, we need to transform the times to reflect the floor hour in order to combine the time series data.

water1 <- water1 %>%
    mutate(Datetime = as.POSIXct(`Date Time` * (60*60*24), origin = '1899-12-30') %>% round_date(unit = 'minute') %>%
             floor_date(unit = 'hour')
           ) %>% 
  select(Datetime, WaterFlow)
head(water1)

## # A tibble: 6 x 2
##   Datetime            WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-22 20:00:00     23.4 
## 2 2015-10-22 20:00:00     28.0 
## 3 2015-10-22 20:00:00     23.1 
## 4 2015-10-22 20:00:00     30.0 
## 5 2015-10-22 21:00:00      6.00
## 6 2015-10-22 21:00:00     15.9

water2 <- water2 %>%
    mutate(Datetime = as.POSIXct(`Date Time` * (60*60*24), origin = '1899-12-30') %>% round_date(unit = 'minute')) %>%
  select(Datetime, WaterFlow)
head(water2)

## # A tibble: 6 x 2
##   Datetime            WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-22 21:00:00      18.8
## 2 2015-10-22 22:00:00      43.1
## 3 2015-10-22 23:00:00      38.0
## 4 2015-10-23 00:00:00      36.1
## 5 2015-10-23 01:00:00      31.9
## 6 2015-10-23 02:00:00      28.2

Now that we have the date variable set to the hour, we can combine and manipulate all of the Water Flow data by taking the aggregate mean of the hour.

water <- rbind(water1, water2)

water <- water %>%
  group_by(Datetime) %>%
  summarise(WaterFlow = mean(WaterFlow)) %>%
  as_tsibble(index = 'Datetime')

head(water)

## # A tsibble: 6 x 2 [1h] <?>
##   Datetime            WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-22 20:00:00      26.1
## 2 2015-10-22 21:00:00      18.8
## 3 2015-10-22 22:00:00      32.5
## 4 2015-10-22 23:00:00      23.1
## 5 2015-10-23 00:00:00      22.4
## 6 2015-10-23 01:00:00      23.6

Graphing the aggregated Water Flow time series data.

water %>%
  autoplot() +
  labs(title = 'Mean of Hourly Water Flow Data')

## Plot variable not specified, automatically selected `.vars = WaterFlow`

Perform a Box Cox transformation to stabilize the data.

lambda <- water %>%
  features(WaterFlow, features = guerrero) %>%
  pull(lambda_guerrero)

water <- water %>%
  mutate(WaterFlow = box_cox(WaterFlow, lambda))

water %>%
  gg_tsdisplay() +
  labs(title = 'Box Cox Transformation of Hourly Mean Water Flow Data')

## Plot variable not specified, automatically selected `y = WaterFlow`

Here we will create a models using the Exponential Smoothing, ARIMA, and Seasonal Naive methods to forecast a weeks period. We will compare the results on the existing data and choose the best performing model to predict the Cash variable for Water Flow for 1 week in advance.

water_train <- water %>%
  filter_index(. ~ '2015-11-26') 

water_fit <- water_train %>%
  model(
    ETS = ETS(WaterFlow),
    ARIMA = ARIMA(WaterFlow),
    SNAIVE = SNAIVE(WaterFlow)
  ) 

water_fc <- water_fit %>% forecast(h = '1 week')

water_fc %>%
  autoplot(water_train, level = NULL) + 
  autolayer(
    filter_index(water, '2015-11-27' ~ .),
    color = 'black'
  ) +
  labs(
    title = 'Forecasts for Daily Cash Production by water'
  ) +
  guides(color = guide_legend(title = 'Forecast'))

## Plot variable not specified, automatically selected `.vars = WaterFlow`

Taking a look at the model’s metrics to see its performance on the unused data.

accuracy(water_fc, water)

## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 12 observations are missing between 2015-12-03 12:00:00 and 2015-12-03 23:00:00

## # A tibble: 3 x 10
##   .model .type      ME   RMSE    MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr>   <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ARIMA  Test  0.00319 0.0846 0.0504 -0.344  4.16 0.956 0.898 0.0474
## 2 ETS    Test  0.00338 0.0846 0.0505 -0.330  4.17 0.957 0.898 0.0474
## 3 SNAIVE Test  0.0176  0.103  0.0689  0.732  5.49 1.31  1.10  0.0370

The generated ARIMA model performs the best when looking at the all metrics.

Using the Ljung-Box test on the residuals from each model, we will see which models have residuals that are not distinguishable from white noise.

water_fit %>%
  select(ARIMA) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 50, dof = 0)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     51.4     0.419

water_fit %>%
  select(ETS) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 50, dof = 0)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ETS       51.2     0.425

water_fit %>%
  select(SNAIVE) %>%
  augment() %>%
  features(.innov, box_pierce, lag = 50, dof = 0)

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 SNAIVE    291.         0

Since the ARIMA model performs the best of the models and its residuals are not distinguishable from white noise we will use the ARIMA model in forecasting the time series data.

water_fit %>%
  select(ARIMA) %>%
  report()

## Series: WaterFlow 
## Model: ARIMA(0,1,1) 
## 
## Coefficients:
##           ma1
##       -0.9822
## s.e.   0.0063
## 
## sigma^2 estimated as 0.00456:  log likelihood=1076.03
## AIC=-2148.06   AICc=-2148.05   BIC=-2138.59

water_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals()

Performing the forecast using the ARIMA method and performing the inverse of the Box Cox transformation.

water_fit <- water %>%
  model(ARIMA(WaterFlow))

water_fc <- water_fit %>% forecast(h = '1 week')

water_fc %>%
  autoplot() + 
  autolayer(water)

## Plot variable not specified, automatically selected `.vars = WaterFlow`

water_fc_1week <- water_fc %>%
  as.data.frame() %>%
  select(Datetime, WaterFlow, .mean)

water_fc_1week$WaterFlow <- inv_box_cox(water_fc_1week$.mean, lambda)

water_fc_1week <- water_fc_1week %>% select(-.mean)

Writing the forecasted period in an excel file.

write_xlsx(water_fc_1week, 'WaterFlow 1 Week Forecast.xlsx')

DATA624 - Project 1

Diego Correa

10/28/2021

Part 1 - ATM

Data Preparation and Exploration

ATM1

ATM2

ATM3

ATM4

Part B - Forecasting Power, Residential Customer Forecast

Data Preparation and Exploration

Part C - Waterflow Pipe

Data Preparation and Exploration