Data 624 Project 1

library(forecast)
library(fpp3)
library(ggplot2)
library(knitr)
library(openxlsx)
library(readxl)
library(seasonal)
library(tidyverse)
library(VIM)

Prompt A

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

Approach

We’ve been tasked with analyzing historical ATM data and forecasting what the next month of transactions could look like. As with any good analysis, we must start off by getting familiar with the data. Afterwards we address any issues we may have found in the data that could impede a smooth analysis. Then, once prepared, we begin testing different forecast models to try and see which one fits best. Finally, we use the best fit model to forecast the solicited data and present it to our new boss.

Data Exploration and Preparation

We begin by importing and loading our desired data set and running some summary functions to get familiar with it. We then make any necessary fixes and transformations to prepare it for analysis.

Load and View Data

atm_raw <- read.csv("https://raw.githubusercontent.com/Stevee-G/Data624/refs/heads/main/Project1/ATMData.csv")

str(atm_raw)

## 'data.frame':    1474 obs. of  3 variables:
##  $ DATE: chr  "5/1/2009 12:00:00 AM" "5/1/2009 12:00:00 AM" "5/2/2009 12:00:00 AM" "5/2/2009 12:00:00 AM" ...
##  $ ATM : chr  "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: int  96 107 82 89 85 90 90 55 99 79 ...

head(atm_raw)

##                   DATE  ATM Cash
## 1 5/1/2009 12:00:00 AM ATM1   96
## 2 5/1/2009 12:00:00 AM ATM2  107
## 3 5/2/2009 12:00:00 AM ATM1   82
## 4 5/2/2009 12:00:00 AM ATM2   89
## 5 5/3/2009 12:00:00 AM ATM1   85
## 6 5/3/2009 12:00:00 AM ATM2   90

summary(atm_raw)

##      DATE               ATM                 Cash        
##  Length:1474        Length:1474        Min.   :    0.0  
##  Class :character   Class :character   1st Qu.:    0.5  
##  Mode  :character   Mode  :character   Median :   73.0  
##                                        Mean   :  155.6  
##                                        3rd Qu.:  114.0  
##                                        Max.   :10920.0  
##                                        NA's   :19

atm_raw %>% 
  filter(is.na(Cash))

##                     DATE  ATM Cash
## 1  6/13/2009 12:00:00 AM ATM1   NA
## 2  6/16/2009 12:00:00 AM ATM1   NA
## 3  6/18/2009 12:00:00 AM ATM2   NA
## 4  6/22/2009 12:00:00 AM ATM1   NA
## 5  6/24/2009 12:00:00 AM ATM2   NA
## 6   5/1/2010 12:00:00 AM        NA
## 7   5/2/2010 12:00:00 AM        NA
## 8   5/3/2010 12:00:00 AM        NA
## 9   5/4/2010 12:00:00 AM        NA
## 10  5/5/2010 12:00:00 AM        NA
## 11  5/6/2010 12:00:00 AM        NA
## 12  5/7/2010 12:00:00 AM        NA
## 13  5/8/2010 12:00:00 AM        NA
## 14  5/9/2010 12:00:00 AM        NA
## 15 5/10/2010 12:00:00 AM        NA
## 16 5/11/2010 12:00:00 AM        NA
## 17 5/12/2010 12:00:00 AM        NA
## 18 5/13/2010 12:00:00 AM        NA
## 19 5/14/2010 12:00:00 AM        NA

As can be seen above, there are a few issues already. The DATE variable is in character format, rendering useless as a time series, and there are both missing values and blank rows. Below, we address this.

Address Formatting and Missing Values

atm_raw$DATE <- str_extract(atm_raw$DATE, "^[^ ]+") %>%
  as.Date(format = "%m/%d/%Y")

str(atm_raw)

## 'data.frame':    1474 obs. of  3 variables:
##  $ DATE: Date, format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr  "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: int  96 107 82 89 85 90 90 55 99 79 ...

atm_filtered <- atm_raw %>% 
  filter(ATM != "")

atm_wider <- atm_filtered %>% 
  pivot_wider(names_from = ATM,
              values_from = Cash)
str(atm_wider)

## tibble [365 × 5] (S3: tbl_df/tbl/data.frame)
##  $ DATE: Date[1:365], format: "2009-05-01" "2009-05-02" ...
##  $ ATM1: int [1:365] 96 82 85 90 99 88 8 104 87 93 ...
##  $ ATM2: int [1:365] 107 89 90 55 79 19 2 103 107 118 ...
##  $ ATM3: int [1:365] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ATM4: int [1:365] 777 524 793 908 53 52 55 559 904 879 ...

atm_imputed <- kNN(atm_wider, k = 5) %>% 
  select(c(1:5))
sum(is.na(atm_imputed))

## [1] 0

summary(atm_imputed)

##       DATE                 ATM1             ATM2             ATM3        
##  Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000  
##  1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 25.00   1st Qu.: 0.0000  
##  Median :2009-10-30   Median : 91.00   Median : 66.00   Median : 0.0000  
##  Mean   :2009-10-30   Mean   : 83.99   Mean   : 62.32   Mean   : 0.7205  
##  3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000  
##  Max.   :2010-04-30   Max.   :180.00   Max.   :147.00   Max.   :96.0000  
##       ATM4      
##  Min.   :    2  
##  1st Qu.:  124  
##  Median :  404  
##  Mean   :  474  
##  3rd Qu.:  705  
##  Max.   :10920

ggplot(atm_imputed %>%
         pivot_longer(cols = c(ATM1, ATM2, ATM3, ATM4),
                      names_to = "ATM",
                      values_to = "Cash"),
       aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  labs(title = "ATM Cashflow")

As shown above, we addressed the DATE variable format conflict by transforming it into an official date variable. We then filtered out the rows with blank ATM entries as these cannot be used for analysis. Afterwards, we pivot the data frame wider in order to have all cash values correspond to their respective ATM. We then impute the remaining empty Cash cells by using the K-Nearest Neighbor method looking towards up to 5 neighbor cells. The summary and plot of the outcome can be seen above.

Prepare Tsibbles

atm1 <- select(atm_imputed, c(DATE, ATM1)) %>% 
  as_tsibble(index = DATE)
head(atm1)

## # A tsibble: 6 x 2 [1D]
##   DATE        ATM1
##   <date>     <int>
## 1 2009-05-01    96
## 2 2009-05-02    82
## 3 2009-05-03    85
## 4 2009-05-04    90
## 5 2009-05-05    99
## 6 2009-05-06    88

autoplot(atm1) +
  labs(title = "ATM1 Cashflow")

atm2 <- select(atm_imputed, c(DATE, ATM2)) %>% 
  as_tsibble(index = DATE)
head(atm2)

## # A tsibble: 6 x 2 [1D]
##   DATE        ATM2
##   <date>     <int>
## 1 2009-05-01   107
## 2 2009-05-02    89
## 3 2009-05-03    90
## 4 2009-05-04    55
## 5 2009-05-05    79
## 6 2009-05-06    19

autoplot(atm2) +
  labs(title = "ATM2 Cashflow")

atm3 <- select(atm_imputed, c(DATE, ATM3)) %>% 
  as_tsibble(index = DATE)
head(atm3)

## # A tsibble: 6 x 2 [1D]
##   DATE        ATM3
##   <date>     <int>
## 1 2009-05-01     0
## 2 2009-05-02     0
## 3 2009-05-03     0
## 4 2009-05-04     0
## 5 2009-05-05     0
## 6 2009-05-06     0

autoplot(atm3) +
  labs(title = "ATM3 Cashflow")

atm4 <- select(atm_imputed, c(DATE, ATM4)) %>% 
  as_tsibble(index = DATE)
head(atm4)

## # A tsibble: 6 x 2 [1D]
##   DATE        ATM4
##   <date>     <int>
## 1 2009-05-01   777
## 2 2009-05-02   524
## 3 2009-05-03   793
## 4 2009-05-04   908
## 5 2009-05-05    53
## 6 2009-05-06    52

autoplot(atm4) +
  labs(title = "ATM4 Cashflow")

Above we prepared individual time series tables, one for each ATM, in order to better analyze the patterns of each. Their respective plots are also displayed.

ATM 1

gg_tsdisplay(atm1)

lambda_atm1 <- atm1 %>%
  features(ATM1, features = guerrero) %>% 
  pull(lambda_guerrero)
lambda_atm1

## [1] 0.2395458

afit_atm1 <- auto.arima(atm1, seasonal = TRUE)
afit_atm1

## Series: atm1 
## ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.1671  -0.5893  -0.0962
## s.e.  0.0558   0.0508   0.0497
## 
## sigma^2 = 564.3:  log likelihood = -1642.57
## AIC=3293.14   AICc=3293.25   BIC=3308.66

checkresiduals(afit_atm1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,2)[7]
## Q* = 19.089, df = 11, p-value = 0.05951
## 
## Model df: 3.   Total lags used: 14

tfit_atm1 <- Arima(atm1$ATM1, order = c(0, 0, 1), seasonal = c(0, 1, 2), lambda = lambda_atm1)
fc_atm1 <- tfit_atm1 %>% 
  forecast(h = 31)

autoplot(fc_atm1) +
  labs(title = "ATM1 ARIMA Forecast")

mam_fit_a1 <- atm1 %>%
  model(MAM = ETS(box_cox(ATM1, lambda_atm1) ~ error('M') + trend('A') + season('M')))
mam_fc_a1 <- mam_fit_a1 %>% 
  forecast(h = 31)

mam_fc_a1 %>% 
  autoplot(atm1) +
  labs(title = "ATM1 ETS Forecast")

ATM1 displays seasonal behavior and does not contain any abnormalities in its entries. Above we get a sense of its distribution, variation, and residuals using the gg_tsdisplay() function. We start off the analysis by calculating both lambda and ARIMA fit values in order to understand which direction we should go in. As can be seen, we were suggested an ARIMA (0,0,1)(0,1,2) model. Using what we know, we went ahead and performed the forecast and obtained the results shown. Unfortunately, the ARIMA model did not provide such an ideal forecast. For this reason, we took a second approach and looked towards ETS modeling. Given the features of the time series, an MAM ETS model was fit and used to forecast. The results can be seen directly the results for the ARIMA model.

ATM 2

gg_tsdisplay(atm2)

lambda_atm2 <- atm2 %>%
  features(ATM2, features = guerrero) %>% 
  pull(lambda_guerrero)
lambda_atm2

## [1] 0.6691865

afit_atm2 <- auto.arima(atm2)
afit_atm2

## Series: atm2 
## ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4237  -0.9118  0.4594  0.7985  -0.7488
## s.e.   0.0548   0.0423  0.0849  0.0576   0.0388
## 
## sigma^2 = 585.9:  log likelihood = -1648.6
## AIC=3309.2   AICc=3309.44   BIC=3332.49

checkresiduals(afit_atm2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 8.9793, df = 9, p-value = 0.4392
## 
## Model df: 5.   Total lags used: 14

tfit_atm2 <- Arima(atm2$ATM2, order = c(2, 0, 2), seasonal = c(0, 1, 1), lambda = lambda_atm2)
fc_atm2 <- tfit_atm2 %>% 
  forecast(h = 31)

autoplot(fc_atm2) +
  labs(title = "ATM2 ARIMA Forecast")

madm_fit_a2 <- atm2 %>%
  model(MAdM = ETS(box_cox(ATM2, lambda_atm2) ~ error('M') + trend('Ad') + season('M')))
madm_fc_a2 <- madm_fit_a2 %>% 
  forecast(h = 31)

madm_fc_a2 %>% 
  autoplot(atm2) +
  labs(title = "ATM2 ETS Forecast")

ATM2 also displays seasonal behavior. However, its variability also seems to dip as time goes on. For this reason, it was especially necessary to consider the appropriate lambda to address this decrease in variance. As with ATM1, we get a sense of its distribution, variation, and residuals using the gg_tsdisplay() function. Just like before, we again calculate both lambda and ARIMA fit values in order to understand which direction we should go in. As can be seen, we were suggested an ARIMA (2,0,2)(0,1,1) model. Using this, we went ahead and performed the forecast and obtained the results shown. Sadly, yet again, the ARIMA model did not provide such an ideal forecast. Therefore, we once again looked towards ETS modeling. Given the features of the time series, an MAdM ETS model was fit and used to forecast. The results can be seen directly the results for the ARIMA model.

ATM 3

gg_tsdisplay(atm3)

lambda_atm3 <- atm3 %>%
  features(ATM3, features = guerrero) %>% 
  pull(lambda_guerrero)
lambda_atm3

## [1] 1.365421

afit_atm3 <- auto.arima(atm3)
afit_atm3

## Series: atm3 
## ARIMA(0,0,2) with zero mean 
## 
## Coefficients:
##          ma1     ma2
##       0.8392  0.8557
## s.e.  0.0496  0.0611
## 
## sigma^2 = 25.4:  log likelihood = -1108.69
## AIC=2223.39   AICc=2223.46   BIC=2235.09

checkresiduals(afit_atm3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2) with zero mean
## Q* = 0.1322, df = 12, p-value = 1
## 
## Model df: 2.   Total lags used: 14

tfit_atm3 <- Arima(atm3$ATM3, order = c(2, 0, 2), include.mean = FALSE, lambda = lambda_atm3)
fc_atm3 <- tfit_atm3 %>% 
  forecast(h = 31)

autoplot(fc_atm3) +
  labs(title = "ATM3 ARIMA Forecast")

ses_fit_a3 <- atm3 %>%
  model(SES = ETS(box_cox(ATM3, lambda_atm3) ~ error('A') + trend('N') + season('N')))
ses_fc_a3 <- ses_fit_a3 %>% 
  forecast(h = 31)

ses_fc_a3 %>% 
  autoplot(atm3) +
  labs(title = "ATM3 ETS Forecast")

With ATM3, we will continue what we’ve already done with ATM1 and ATM2. ATM3 does not show any seasonal behavior or anything we could use to identify a pattern since the majority of Cash values are 0 up until the very last 3 entries. Seeing this would suggest that we could get away with just performing a NAIVE or SNAIVE model analysis with the final entries, but we will continue to approach as previously. As with ATM1, we get a sense of its distribution, variation, and residuals using the gg_tsdisplay() function. Just like before, we again calculate both lambda and ARIMA fit values in order to understand which direction we should go in. As can be seen, we were suggested an ARIMA (0,0,2) model with zero mean. Using this, we went ahead and performed the forecast and obtained the results shown. Once again the ARIMA model did not provide a good forecast so we fitted an ETS model. Given the features of the time series, an SES ETS model was fit and used to forecast. The results can be seen directly the results for the ARIMA model.

ATM 4

atm4 %>% 
  sort_by(atm4$ATM4, decreasing = TRUE)

## # A tsibble: 365 x 2 [1D]
##    DATE        ATM4
##    <date>     <int>
##  1 2010-02-09 10920
##  2 2009-09-22  1712
##  3 2010-01-29  1575
##  4 2009-09-04  1495
##  5 2009-06-09  1484
##  6 2009-09-05  1301
##  7 2010-02-28  1276
##  8 2009-08-23  1246
##  9 2009-06-14  1221
## 10 2009-09-29  1195
## # ℹ 355 more rows

atm4$ATM4[which.max(atm4$ATM4)] <- median(atm4$ATM4, na.rm = TRUE)

atm4 %>% 
  sort_by(atm4$ATM4, decreasing = TRUE)

## # A tsibble: 365 x 2 [1D]
##    DATE        ATM4
##    <date>     <int>
##  1 2009-09-22  1712
##  2 2010-01-29  1575
##  3 2009-09-04  1495
##  4 2009-06-09  1484
##  5 2009-09-05  1301
##  6 2010-02-28  1276
##  7 2009-08-23  1246
##  8 2009-06-14  1221
##  9 2009-09-29  1195
## 10 2009-06-12  1191
## # ℹ 355 more rows

gg_tsdisplay(atm4)

lambda_atm4 <- atm4 %>%
  features(ATM4, features = guerrero) %>% 
  pull(lambda_guerrero)
lambda_atm4

## [1] 0.4524373

afit_atm4 <- auto.arima(atm4)
afit_atm4

## Series: atm4 
## ARIMA(0,0,3)(1,0,0)[7] with non-zero mean 
## 
## Coefficients:
##          ma1     ma2     ma3    sar1      mean
##       0.0875  0.0162  0.0886  0.1838  444.5817
## s.e.  0.0532  0.0583  0.0516  0.0526   26.1028
## 
## sigma^2 = 119373:  log likelihood = -2648.96
## AIC=5309.92   AICc=5310.15   BIC=5333.32

checkresiduals(afit_atm4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,3)(1,0,0)[7] with non-zero mean
## Q* = 16.207, df = 10, p-value = 0.09385
## 
## Model df: 4.   Total lags used: 14

tfit_atm4 <- Arima(atm4$ATM4, order = c(0, 0, 3), seasonal = c(1, 0, 0), include.mean = TRUE, lambda = lambda_atm4)
fc_atm4 <- tfit_atm4 %>% 
  forecast(h = 31)

autoplot(fc_atm4) +
  labs(title = "ATM4 ARIMA Forecast")

mam_fit_a4 <- atm4 %>%
  model(MAM = ETS(box_cox(ATM4, lambda_atm4) ~ error('M') + trend('A') + season('M')))
mam_fc_a4 <- mam_fit_a4 %>% 
  forecast(h = 31)

mam_fc_a4 %>% 
  autoplot(atm4) +
  labs(title = "ATM4 ETS Forecast")

ATM4 also displays seasonal behavior. However, it seems to contain an overtly obvious outlier. We deal with this be first identifying how many outliers there are. Doing this we can see there is only one. We addrress this outlier by replacing it with the mean for the data set. We once again check to see if it worked and everything looked fine. We continue on to get a sense of its distribution, variation, and residuals using the gg_tsdisplay() function. Just like before, we calculate both lambda and ARIMA fit values in order to understand which direction we should go in. As can be seen, we were suggested an ARIMA (0,0,3)(1,0,0) model with non-zero mean. Using this, we went ahead and performed the forecast and obtained the results shown. Once again the ARIMA model did not provide a good forecast so we fitted an ETS model. Given the features of the time series, an MAM ETS model was fit and used to forecast. The results can be seen directly the results for the ARIMA model.

Prepare and Export ATM Forecast Data

data_frame(DATE = seq(as.Date("2010-05-01"), by ="day", length.out = 31),
           ATM1 = mam_fc_a1$.mean, 
           ATM2 = madm_fc_a2$.mean,
           ATM3 = ses_fc_a3$.mean,
           ATM4 = mam_fc_a4$.mean**2) %>%
  gather(key = ATM, value = Cash, -DATE) %>% 
  write.xlsx("ATM_Forecasts.xlsx")

Finally, we take the ETS forecast results and export them as one table to an excel spreadsheet.

Prompt B

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Approach

For this portion, we’ve been tasked with analyzing historical residential power data and forecasting what the next year of usage could look like. Just as with the previous analysis, we start off by getting familiar with the data. We then address any issues we may have found in the data that could impede a smooth analysis. Then, we begin testing different forecast models to try and see which one fits best. Finally, we use the best fit model to forecast the solicited data and present it to our new boss.

Data Exploration and Preparation

Once again, we begin by importing and loading our desired data set and running some summary functions to get familiar with it. We then make any necessary fixes and transformations to prepare it for analysis.

Load and View Data

pwr_raw <- read.csv("https://raw.githubusercontent.com/Stevee-G/Data624/refs/heads/main/Project1/PowerData.csv")

str(pwr_raw)

## 'data.frame':    192 obs. of  3 variables:
##  $ CaseSequence: int  733 734 735 736 737 738 739 740 741 742 ...
##  $ YYYY.MMM    : chr  "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
##  $ KWH         : int  6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428 6989888 6345620 ...

head(pwr_raw)

##   CaseSequence YYYY.MMM     KWH
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147

summary(pwr_raw)

##   CaseSequence     YYYY.MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

pwr_raw %>% 
  filter(is.na(KWH))

##   CaseSequence YYYY.MMM KWH
## 1          861 2008-Sep  NA

As can be seen above, the 2 main issues with this data set are its poorly formatted date variable, YYYY.MMM, and one missing value. We make sure to address these discrepancies below.

Address Formatting and Missing Value

pwr_raw$YYYY.MMM <- yearmonth(pwr_raw$YYYY.MMM)

str(pwr_raw)

## 'data.frame':    192 obs. of  3 variables:
##  $ CaseSequence: int  733 734 735 736 737 738 739 740 741 742 ...
##  $ YYYY.MMM    : mth [1:192] 1998 Jan, 1998 Feb, 1998 Mar, 1998 Apr, 1998 May, 1998 Jun...
##  $ KWH         : int  6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428 6989888 6345620 ...

pwr_imputed <- kNN(pwr_raw, k = 5) %>% 
  select(c(1:3))
sum(is.na(pwr_imputed))

## [1] 0

head(pwr_imputed)

##   CaseSequence YYYY.MMM     KWH
## 1          733 1998 Jan 6862583
## 2          734 1998 Feb 5838198
## 3          735 1998 Mar 5420658
## 4          736 1998 Apr 5010364
## 5          737 1998 May 4665377
## 6          738 1998 Jun 6467147

ggplot(pwr_imputed, aes(x = YYYY.MMM, y = KWH)) +
  geom_line() +
  labs(title = "KWH per Month")

As shown above, we addressed the YYYY.MMM variable formatting by transforming it into an official date variable using the yearmonth function. We then imputed the empty KWH cell using the K-Nearest Neighbor method looking towards up to 5 neighbor cells. A glimpse and plot of the outcome can be seen above.

Prepare Tsibble

power <- pwr_imputed %>% 
  select(-CaseSequence) %>% 
  as_tsibble(index = YYYY.MMM)
head(power)

## # A tsibble: 6 x 2 [1M]
##   YYYY.MMM     KWH
##      <mth>   <int>
## 1 1998 Jan 6862583
## 2 1998 Feb 5838198
## 3 1998 Mar 5420658
## 4 1998 Apr 5010364
## 5 1998 May 4665377
## 6 1998 Jun 6467147

autoplot(power) +
  labs(title = "KWH per Month")

Above we performed a simple tsibble transformation to format the data set into an analyzable time series.

Power

gg_tsdisplay(power)

lambda_p <- power %>%
  features(KWH, features = guerrero) %>% 
  pull(lambda_guerrero)
lambda_p

## [1] 0.107132

afit_p <- auto.arima(power, seasonal = TRUE)
afit_p

## Series: power 
## ARIMA(0,0,1)(1,1,1)[12] with drift 
## 
## Coefficients:
##          ma1     sar1     sma1     drift
##       0.2440  -0.1519  -0.7310  7614.784
## s.e.  0.0723   0.0963   0.0821  1954.202
## 
## sigma^2 = 7.468e+11:  log likelihood = -2719.93
## AIC=5449.85   AICc=5450.2   BIC=5465.82

checkresiduals(afit_p)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(1,1,1)[12] with drift
## Q* = 14.261, df = 21, p-value = 0.8581
## 
## Model df: 3.   Total lags used: 24

tfit_p <- Arima(power$KWH, order = c(0, 0, 1), seasonal = c(1, 1, 1), lambda = lambda_p)
fc_p <- tfit_p %>% 
  forecast(h = 12)

autoplot(fc_p) +
  labs(title = "KWH ARIMA Forecast")

mam_fit_p <- power %>%
  model(MAM = ETS(box_cox(KWH, lambda_p) ~ error('M') + trend('A') + season('M')))
mam_fc_p <- mam_fit_p %>% 
  forecast(h = 12)

mam_fc_p %>% 
  autoplot(power) +
  labs(title = "KWH ETS Forecast")

The power usage data displays seasonal behavior and does not contain any abnormalities in its entries. Above we get a sense of its distribution, variation, and residuals using the gg_tsdisplay() function. We start off the analysis by calculating both lambda and ARIMA fit values in order to understand which direction we should go in. As can be seen, we were suggested an ARIMA (0,0,1)(0,1,1) model. Using what we know, we went ahead and performed the forecast and obtained the results shown. Unfortunately, and yet again, the ARIMA model did not provide such an ideal forecast. Therefore, we once again looked towards ETS modeling. Given the features of the time series, an MAM ETS model was fit and used to forecast. The results can be seen directly the results for the ARIMA model.

Prepare and Export KWH Forecast Data

data_frame(`YYYY.MMM` = paste0(2014, "-", month.abb), KWH = mam_fc_p$.mean) %>% 
  write.xlsx("Power_KWH_Forecasts.xlsx")

Finally, we take the ETS forecast result and export it to an excel spreadsheet.

Data 624 Project 1

Steven Gonzalez

3/30/2025

Prompt A

Approach

Data Exploration and Preparation

Load and View Data

Address Formatting and Missing Values

Prepare Tsibbles

ATM 1

ATM 2

ATM 3

ATM 4

Prepare and Export ATM Forecast Data

Prompt B

Approach

Data Exploration and Preparation

Load and View Data

Address Formatting and Missing Value

Prepare Tsibble

Power

Prepare and Export KWH Forecast Data