Data 624 Project 1

9/24/2021

Gabe Abreu

Project 1

Part A - ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Data Exploration and Data Cleaning

glimpse(raw_atm_data)

## Rows: 1,474
## Columns: 3
## $ DATE <dttm> 2009-05-01, 2009-05-01, 2009-05-02, 2009-05-02, 2009-05-03, 2009~
## $ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "~
## $ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, ~

Only 3 columns: DATE, ATM, CASH. The date column is dbl data type and there are multiple ATMs to consider for future forecasting. The main concern right now is the conversion of the date column into a data type that can be used for a time series analysis.

It’s important for any data exploration to include finding missing values or NAs.

Only a very small percentage is missing from the ATM and CASH columns. There are no dates missing which is a very good thing especially for a time series analysis. We can easily impute the missing data or remove the missing data.

## # A tibble: 19 x 3
##    DATE                ATM    Cash
##    <dttm>              <chr> <dbl>
##  1 2009-06-13 00:00:00 ATM1     NA
##  2 2009-06-16 00:00:00 ATM1     NA
##  3 2009-06-18 00:00:00 ATM2     NA
##  4 2009-06-22 00:00:00 ATM1     NA
##  5 2009-06-24 00:00:00 ATM2     NA
##  6 2010-05-01 00:00:00 <NA>     NA
##  7 2010-05-02 00:00:00 <NA>     NA
##  8 2010-05-03 00:00:00 <NA>     NA
##  9 2010-05-04 00:00:00 <NA>     NA
## 10 2010-05-05 00:00:00 <NA>     NA
## 11 2010-05-06 00:00:00 <NA>     NA
## 12 2010-05-07 00:00:00 <NA>     NA
## 13 2010-05-08 00:00:00 <NA>     NA
## 14 2010-05-09 00:00:00 <NA>     NA
## 15 2010-05-10 00:00:00 <NA>     NA
## 16 2010-05-11 00:00:00 <NA>     NA
## 17 2010-05-12 00:00:00 <NA>     NA
## 18 2010-05-13 00:00:00 <NA>     NA
## 19 2010-05-14 00:00:00 <NA>     NA

Looking at the missing values, there are 14 rows where the ATM and cash amounts are missing, 5 rows with at least the ATM information. Eliminating the rows without ATM and cash information seems prudent. Imputing data in thoses cases would be close to making up data. We can impute the remaining data using the package imputeTS and the function na.mean.

#Omit rows with NA Data
complete_atm_data <- na.omit(raw_atm_data)

As it stands, the data set has three columns. It is beneficial to create 5 columns total, 4 columns with the individual ATM data and the Dates. The new columns will show the cash values in a much easier format.

#Change the data frame that shows the four ATMs and their respective cash values
complete_atm_data1 <- complete_atm_data %>% pivot_wider(names_from = ATM, values_from = Cash)

ATM1 and ATM2 have lower cash distributions, ATM 1 is fairly normally distributed. ATM2 is not normally distributed but more spread out. ATM3 mostly has no cash withdrawals with a few outliers. ATM4 has much higher cash value withdrawals with a pretty high skewed value.

complete_atm_data1%>%select(-DATE)%>%plot_histogram()

ggplot(complete_atm_data, aes(x=ATM, y=Cash, group=ATM)) + geom_boxplot(aes(fill=ATM))

Time Series Analysis

Since there are four atms, its best to separate the atms into individual time series.

The DATE column is a POSIXct type that needs to be converted to date time in order to convert the data frame to a tsibble.

complete_atm_data1$DATE <- as.Date(complete_atm_data1$DATE)

## Using `DATE` as index variable.

#Select ATMs from the time series 
ts_atm1 <- ts_catm %>% select(ATM1)
ts_atm2 <- ts_catm %>% select(ATM2)
ts_atm3 <- ts_catm %>% select(ATM3)
ts_atm4 <- ts_catm %>% select(ATM4)

ATM1

Quickly impute the NA values with dates using the na_mean funciton from the ImputeTS package.

ts_atm1_imp <- na_mean(ts_atm1)

Partition training data to evaluate the models

train1 <- ts_atm1_imp %>% filter_index(. ~'2010-02-28')

## Plot variable not specified, automatically selected `y = ATM1`

Based on the plot above there is autocorrelation at lag 7, there are spikes above the critical value range. The PACF also shows a sharp value at lag 7. There needs to be further examination of any seasonality and trend. For that, let’s perform a decomposition. This series is clearly non-stationary and there is some seasonality.

Performing a decomposition, we can examine the individual components that contribute to the overall series.

Exponential Smoothing Model

The time series above is seasonal but the variation is not increasing with the levels, the seasonal variation is roughly constant therefore the additive method is preferred. I would manually select an ETS(ANA) since the trend is neglible. Let’s run the ETS function to see if the automated selection matches my thought process.

fit_ets <- train1 %>% model(ETS(ATM1))
report(fit_ets)

## Series: ATM1 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.06840623 
##     gamma = 0.0001000355 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]    s[-3]   s[-4]    s[-5]    s[-6]
##  81.03764 -64.65422 -7.532837 22.19303 4.194419 16.8054 13.49184 15.50238
## 
##   sigma^2:  586.7125
## 
##      AIC     AICc      BIC 
## 3686.699 3687.450 3723.869

augment(fit_ets) %>%
    features(.innov, ljung_box, lag = 16, dof = 6)

## # A tibble: 1 x 3
##   .model    lb_stat lb_pvalue
##   <chr>       <dbl>     <dbl>
## 1 ETS(ATM1)    27.6   0.00206

Despite the normally distributed residuals, the ACF and Ljung-Box test indicates that the residuals are not white noise. The p-vaule is much smaller than 0.05, so we must reject the null hypothesis.

Arima

In order to create an Arima forecast, we have to make the data stationary. Given the visuals from above, we know that the data is seasonal, so let’s check the number of seasonal differences needed using the nsdiffs function.

As expected, ndiffs gives 0 for non-seasonal differencing.

ndiffs(train1$ATM1)

## [1] 0

There needs to be first order seasonal differencing, which is expected due to the seasonality of the data.

nsdiffs(ts(train1$ATM1, frequency = 7))

## [1] 1

fit_arima <- train1 %>% model(ARIMA(ATM1))
report(fit_arima)

## Series: ATM1 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.2004  -0.7469  -0.1728
## s.e.  0.0591   0.0621   0.0596
## 
## sigma^2 estimated as 581.3:  log likelihood=-1371.3
## AIC=2750.61   AICc=2750.74   BIC=2765.38

Validating the different models:

bind_rows(
    fit_arima %>% accuracy(),
    fit_ets %>% accuracy(),
    fit_arima %>% forecast(h = 60) %>% accuracy(ts_atm1_imp),
    fit_ets %>% forecast(h = 60) %>% accuracy(ts_atm1_imp)
  ) %>%
  select(-ME, -MPE, -ACF1)

## # A tibble: 4 x 7
##   .model      .type     RMSE   MAE  MAPE  MASE RMSSE
##   <chr>       <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(ATM1) Training  23.7  15.0  116. 0.784 0.796
## 2 ETS(ATM1)   Training  23.9  14.9  125. 0.776 0.801
## 3 ARIMA(ATM1) Test      44.0  32.9  377. 1.72  1.48 
## 4 ETS(ATM1)   Test      51.7  38.4  456. 2.00  1.74

The ARIMA model model performed slightly worse with the training data but more than marginally better on the testing data. Comparing the RMSE, MAE, MAPE, MASE, and RMSSE parameters, the ARIMA model outperformed ETS on all variables.

Forecast ATM1

ATM 2

ts_atm2_imp <- na_mean(ts_atm2)

train2 <- ts_atm2_imp %>% filter_index(. ~'2010-02-28')

## Plot variable not specified, automatically selected `y = ATM2`

Much like ATM1, there is seasonality and the series is non-stationary.

Performing a decomposition, we can examine the individual components that contribute to the overall series.

Exponential Smoothing Model

fit_ets2 <- train2 %>% model(ETS(ATM2))
report(fit_ets2)

## Series: ATM2 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.004507263 
##     gamma = 0.1531301 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]      s[-3]    s[-4]    s[-5]    s[-6]
##  71.32288 -59.57489 -44.09744 26.77243 -0.3679009 10.10128 24.97068 42.19585
## 
##   sigma^2:  643.0447
## 
##     AIC    AICc     BIC 
## 3714.57 3715.32 3751.74

augment(fit_ets2) %>%
    features(.innov, ljung_box, lag = 16, dof = 6)

## # A tibble: 1 x 3
##   .model    lb_stat lb_pvalue
##   <chr>       <dbl>     <dbl>
## 1 ETS(ATM2)    23.2   0.00998

Despite the normally distributed residuals, the ACF and Ljung-Box test indicates that the residuals are not white noise. The p-vaule is much smaller than 0.05, so we must reject the null hypothesis.

Arima

According to ndiffs, there should be a first order differencing for non-seasonal data. This is different from ATM1.

ndiffs(train2$ATM2)

## [1] 1

There needs to be first order seasonal differencing, which is expected due to the seasonality of the data.

nsdiffs(ts(train2$ATM2, frequency = 7))

## [1] 1

fit_arima2 <- train2 %>% model(ARIMA(ATM2))
report(fit_arima2)

## Series: ATM2 
## Model: ARIMA(0,0,0)(0,1,1)[7] 
## 
## Coefficients:
##          sma1
##       -0.7984
## s.e.   0.0477
## 
## sigma^2 estimated as 641.6:  log likelihood=-1384.37
## AIC=2772.74   AICc=2772.78   BIC=2780.13

I will quickly investigate whether adding a first order differencing to the series will produce better results than the automated model which selected not to include any non-seasonal differencing.

fit_custom <- train2 %>% model(ARIMA(ATM2 ~ pdq(0,1,0) + PDQ(0,1,1)))
report(fit_custom)

## Series: ATM2 
## Model: ARIMA(0,1,0)(0,1,1)[7] 
## 
## Coefficients:
##          sma1
##       -0.7941
## s.e.   0.0575
## 
## sigma^2 estimated as 1241:  log likelihood=-1477.27
## AIC=2958.54   AICc=2958.59   BIC=2965.92

Adding non-seasonal differencing doesn’t not improve the results.

## # A tibble: 4 x 7
##   .model      .type     RMSE   MAE  MAPE  MASE RMSSE
##   <chr>       <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(ATM2) Training  25.0  17.1   Inf 0.783 0.798
## 2 ETS(ATM2)   Training  25.0  17.2   Inf 0.788 0.798
## 3 ARIMA(ATM2) Test      44.1  40.2   Inf 1.84  1.41 
## 4 ETS(ATM2)   Test      48.5  44.2   Inf 2.03  1.55

The ARIMA model model performed slightly worse with the training data but more than marginally better on the testing data. Comparing the RMSE, MAE, MASE, and RMSSE parameters, the ARIMA model outperformed ETS on all variables.

ATM 3

This is a unique case since it only has very few data points which all occur at the tail end of the series. Perhaps the ATM Was not in use until very recently or broken. Given the lack of data, ARIMA or ETS wouldn’t be accurate, it has a very sharp upward trend, no seasonality, and only 3 non-zero data points.

The average of the three points is 87.7. If we were to do a NAIVE forecast which is setting all forecasts to the last observation which is 85. The NAIVE forecast seems appropriate.

m_Fit <- ts_atm3 %>% model(NAIVE = NAIVE(ATM3))

mean_forecast<- m_Fit %>% forecast(h=31)

mean_forecast %>%
    autoplot(ts_atm3)

ATM 4

gg_tsdisplay(ts_atm4, plot_type = 'partial')

## Plot variable not specified, automatically selected `y = ATM4`

There is a major spike in the series, however the series appears to be non-stationary.

The outlier is affecting the series. It is a major spike and it needs to be cleaned. The purpose of the forecast is to produce realistic daily forecast, the outlier should not be considered.

Let’s find the location of the maximum value and replace it with the average value

## [1] 285

train4 <- ts_atm4 %>% filter_index(. ~'2010-02-28')

This series is different than ATM1 and ATM2, trend and seasonality are not major contributors to the overall series.

Exponential Smoothing Model

The time series above is not really seasonal and the variation is not increasing with the levels. I would manually select an ETS(ANN) since the trend is neglible. Let’s run the ETS function to see if the automated selection matches my thought process.

ETS Model

fit_ets4 <- train4 %>% model(ETS(ATM4))
report(fit_ets4)

## Series: ATM4 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0002989694 
##     gamma = 0.0001000218 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  455.7059 -334.8196 -75.13547 93.63183 33.80112 117.6202 51.78543 113.1164
## 
##   sigma^2:  111569.1
## 
##      AIC     AICc      BIC 
## 5282.050 5282.801 5319.220

The automated ETS selected ETS(A,N,A). Which is relatively close to my expectation, however I will also create an ETS(A,N,N) model since the seasonality appeared to be minimal in the decomposition above.

fit_ets5 <- train4 %>%
            model(ETS(ATM4 ~ error("A") +
                       trend("N") + season("N")))
report(fit_ets5)

## Series: ATM4 
## Model: ETS(A,N,N) 
##   Smoothing parameters:
##     alpha = 0.0001000255 
## 
##   Initial states:
##      l[0]
##  451.8845
## 
##   sigma^2:  131651.8
## 
##      AIC     AICc      BIC 
## 5325.496 5325.576 5336.647

augment(fit_ets4) %>%
    features(.innov, ljung_box, lag = 16, dof = 6)

## # A tibble: 1 x 3
##   .model    lb_stat lb_pvalue
##   <chr>       <dbl>     <dbl>
## 1 ETS(ATM4)    23.0    0.0109

augment(fit_ets5) %>%
    features(.innov, ljung_box, lag = 16, dof = 6)

## # A tibble: 1 x 3
##   .model                                                    lb_stat lb_pvalue
##   <chr>                                                       <dbl>     <dbl>
## 1 "ETS(ATM4 ~ error(\"A\") + trend(\"N\") + season(\"N\"))"    27.5   0.00213

The ETS(A,N,N) model’s residual does not appear to be white noise and the histogram is skewed. The auto generated model does produce better validation results and will therefore be the chosen model for ETS.

Arima

Data already appears to be stationary, but for the sake of consistency, we will run the ndiffs and nsdiffs functions.

ndiffs(train4$ATM4)

## [1] 0

nsdiffs(ts(train4$ATM4, frequency = 7))

## [1] 0

The functions reported no differncing is necessary. Therefore, I expect that a reasonable ARIMA model will not include differencing, at most 1.

fit_arima4 <- train4 %>% model(ARIMA(ATM4))
report(fit_arima4)

## Series: ATM4 
## Model: ARIMA(0,0,0)(1,0,0)[7] w/ mean 
## 
## Coefficients:
##         sar1  constant
##       0.1798  371.1336
## s.e.  0.0575   20.3060
## 
## sigma^2 estimated as 127447:  log likelihood=-2217.3
## AIC=4440.6   AICc=4440.68   BIC=4451.75

The returned model is ARIMA(0,0,0) w/mean. This is within expectations since there is no seasonality or trend to the data. The functions above reported a zero order differencing for both the non-seasonal and seasonal components.

## # A tibble: 4 x 7
##   .model      .type     RMSE   MAE  MAPE  MASE RMSSE
##   <chr>       <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(ATM4) Training  356.  298.  542. 0.844 0.770
## 2 ETS(ATM4)   Training  329.  258.  463. 0.729 0.712
## 3 ARIMA(ATM4) Test      296.  254.  887. 0.718 0.641
## 4 ETS(ATM4)   Test      347.  297. 1035. 0.839 0.751

The ARIMA model ARIMA(0,0,0) is a special case and is used for white noise, it performed only marginally better than the manually seelcted ETS(A,N,N) model.

Forecast

atm_prod4 <- ts_atm4 %>%
                model(ARIMA(ATM4))
atm_prod_forecast4 <- atm_prod4 %>% forecast(h=31)
atm_prod_forecast4 %>% autoplot(ts_atm4)

The forecasted model looks no better than a random walk or MEAN forecast.

mean_Fit <- ts_atm4 %>% model(MEAN = MEAN(ATM4))

mean_forecast2<- mean_Fit %>% forecast(h=31)

mean_forecast2 %>%
    autoplot(ts_atm4)

Given the random nature of the series and basing it off the plotted series, it does seem that the values revert back to the mean. I’m inclined to actually move forward not with the Arima model but the MEAN model. In thinking of a business use case, MEAN would be easier to explain to a non-technical audience and visually it makes more sense than explaining the nuances of ARIMA.

Excel Output

forecast_df <- tibble(date = atm_prod_forecast1$DATE, atm1 = atm_prod_forecast1$.mean, atm2 = atm_prod_forecast2$.mean, atm3 = mean_forecast$.mean, atm4 = mean_forecast2$.mean)
write.csv(forecast_df,"C:/Users/gabre/OneDrive/Desktop/D624/Project1PartA.csv", row.names = FALSE)

——————————————————————————————————————————————–

Part B - Power Forecast

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Data Exploration and Data Cleaning

glimpse(raw_power_data)

## Rows: 192
## Columns: 3
## $ CaseSequence <dbl> 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 74~
## $ `YYYY-MMM`   <chr> "1998-Jan", "1998-Feb", "1998-Mar", "1998-Apr", "1998-May~
## $ KWH          <dbl> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 891~

Only 3 columns: CaseSequence, ‘YYYY-MMM’, ‘KWH’. The CaseSequence column is insignificant to the overall time series analysis and should be removed. The column that should be datetime is type ‘chr’ and KWH is dbl. The ‘YYYY-MMM’ column needs to be converted into the appropriate data type before any signifcant analysis can be performed.

It’s important for any data exploration to include finding missing values or NAs.

ts_power <- raw_power_data %>% select(KWH) %>% ts(start = decimal_date(date("1998-01-01")), frequency = 12)

The autoplot of the power usage series shows a little increase in variation with the levels, so I think a BoxCox Transformation will be beneficial. There is also seasonality based on my initial impression.

autoplot(ts_power)

By looking at the raw data with the complete.cases function we can see that CaseSquence 861, September 2008’s data is missing. We can take the average of previous Septembers’ to estimate this value. The rest of the data series looks pretty good.

## # A tibble: 1 x 3
##   CaseSequence `YYYY-MMM`   KWH
##          <dbl> <chr>      <dbl>
## 1          861 2008-Sep      NA

Going to take use imputeTS’s na_mean function to impute the missing value. As a side note, I tried imputing using median and mode and it changed the characteristic of the plot. Mode created another outlier and median was not a realistic value for September.

ts_imp_pwr <- na_mean(ts_power, option = 'mean')

The imputed value is 77, which is certainly within the acceptable range of values.

ggtsdisplay(ts_imp_pwr, plot_type = 'partial')

## Warning: Ignoring unknown parameters: plot_type

Looking athe ACF and PACF plot there is seasonality and autocorrelation. For the arima model we’re going to have to perform some sort of differencing and a box cox transformation is going to be valuable. There is also a clear outlier which is going to affect the series.

The outlier could be a blackout or the result of some natural disaster. If the analysis was an analysis on overall previous power usage, the outlier can stay, but since we are trying to predict future typical usage, I will replace the outlier with a reasonable value.

Let’s find the location of the minimum value and replace it.

## [1] 151

## Warning in NextMethod("[<-"): number of items to replace is not a multiple of
## replacement length

A quick plot of the series with all the transformations:

autoplot(ts_imp_pwr)

decompose(ts_imp_pwr) %>% autoplot()

ggtsdisplay(pwr_bxcx)

The boxcox looks to have resolved the variation in the series, however, the series is still non-stationary.

ts_imp_pwr <- ts_imp_pwr %>% as_tsibble()
train_pwr <- ts_imp_pwr %>% filter_index(. ~'2010-12')
test_pwr <- ts_imp_pwr %>% filter_index('2011-01'~.)

Examine the training data:

## Plot variable not specified, automatically selected `y = value`

There is seasonality evident in the series and values outside of the critical range, indicating a non-stationary series.

Exponential Smoothing Model

The series above is still seasonal but the variation should be considered consistent across the series with no clear trend, therefore I would produce an ETS(A,N,A) model.

ets_pwr <- train_pwr %>% model(ETS(box_cox(value, power_lambda)))
report(ets_pwr)

## Series: value 
## Model: ETS(M,N,A) 
## Transformation: box_cox(value, power_lambda) 
##   Smoothing parameters:
##     alpha = 0.06999936 
##     gamma = 0.0001000011 
## 
##   Initial states:
##      l[0]         s[0]       s[-1]        s[-2]       s[-3]       s[-4]
##  4.627449 -0.002678869 -0.01078707 -0.003477424 0.007360129 0.009686945
##        s[-5]         s[-6]        s[-7]        s[-8]        s[-9]      s[-10]
##  0.006884593 -4.838848e-05 -0.009278673 -0.006776184 -0.002053227 0.002725681
##       s[-11]
##  0.008442485
## 
##   sigma^2:  0
## 
##       AIC      AICc       BIC 
## -966.2103 -962.7818 -920.4625

The ETS function created an ETS(M,N,A). I will create an ETS(A,N,A) model and cross-validate.

ets_pwr_manual <- train_pwr %>%
                model(ETS(box_cox(value, power_lambda) ~ error("A") +
                       trend("N") + season("A")))
report(ets_pwr_manual)

## Series: value 
## Model: ETS(A,N,A) 
## Transformation: box_cox(value, power_lambda) 
##   Smoothing parameters:
##     alpha = 0.07622603 
##     gamma = 0.0001009679 
## 
##   Initial states:
##      l[0]         s[0]       s[-1]        s[-2]       s[-3]       s[-4]
##  4.626121 -0.002339438 -0.01067408 -0.003604131 0.007040619 0.009586188
##        s[-5]         s[-6]       s[-7]        s[-8]        s[-9]      s[-10]
##  0.007373331 -0.0001179183 -0.00952559 -0.006779878 -0.002015573 0.002480494
##      s[-11]
##  0.00857598
## 
##   sigma^2:  0
## 
##       AIC      AICc       BIC 
## -965.6777 -962.2491 -919.9299

ETS(M, N, A) Residuals

ETS(A,N,A) Residuals

augment(ets_pwr) %>%
    features(.innov, ljung_box, lag = 16, dof = 6)

## # A tibble: 1 x 3
##   .model                            lb_stat  lb_pvalue
##   <chr>                               <dbl>      <dbl>
## 1 ETS(box_cox(value, power_lambda))    41.7 0.00000833

augment(ets_pwr) %>%
    features(.innov, ljung_box, lag = 16, dof = 6)

## # A tibble: 1 x 3
##   .model                            lb_stat  lb_pvalue
##   <chr>                               <dbl>      <dbl>
## 1 ETS(box_cox(value, power_lambda))    41.7 0.00000833

Despite the normally distributed residuals, the ACF and Ljung-Box test indicates that the residuals are not white noise. The p-vaule is much smaller than 0.05, so we must reject the null hypothesis.

Arima

According to ndiffs, there should be a zero order differencing for non-seasonal data.

ndiffs(train_pwr$value)

## [1] 0

There needs to be first order seasonal differencing, which is expected due to the seasonality of the data.

nsdiffs(ts(train_pwr$value, frequency = 12))

## [1] 1

pwr_arima <- train_pwr %>% model(ARIMA(box_cox(value, power_lambda )))
report(pwr_arima)

## Series: value 
## Model: ARIMA(1,0,1)(0,1,2)[12] w/ drift 
## Transformation: box_cox(value, power_lambda) 
## 
## Coefficients:
##           ar1     ma1     sma1    sma2  constant
##       -0.4586  0.6647  -0.8370  0.1797     5e-04
## s.e.   0.1973  0.1588   0.0915  0.0946     2e-04
## 
## sigma^2 estimated as 1.314e-05:  log likelihood=613.74
## AIC=-1215.48   AICc=-1214.86   BIC=-1197.66

The ARIMA function produced ARIMA(1,0,1)(0,1,2) w/ drift. This is a reasonable model since it was expected to see a zero order non-seasonal differencing and a first order seasonal differencing.

pwr_arima %>% gg_tsresiduals(lag_max=16)

augment(pwr_arima) %>%
    features(.innov, ljung_box, lag = 16, dof = 6)

## # A tibble: 1 x 3
##   .model                              lb_stat lb_pvalue
##   <chr>                                 <dbl>     <dbl>
## 1 ARIMA(box_cox(value, power_lambda))    28.1   0.00174

The p-value suggests the residuals is not white noise.

Explore slightly adjusted ARIMA models and use the automated search function. The search function forgoes the stepwise and does a harder search for values. Since, the ARIMA model with the box cox transformation did not produce a large enough p-value, it is prudent to search for alternative models.

ar_pwr_fit <- train_pwr %>% model(
        arima111= ARIMA((box_cox(value, power_lambda ))~ pdq(1,0,1) + PDQ(1,1,1)),
        arimaN111 = ARIMA((box_cox(value, power_lambda )) ~ pdq(1,1,1) + PDQ(0,1,2)),
        stepwise = ARIMA((box_cox(value, power_lambda ))),
        search = ARIMA((box_cox(value, power_lambda )), stepwise=FALSE))

glance(ar_pwr_fit) %>% arrange(AICc)

## # A tibble: 4 x 8
##   .model       sigma2 log_lik    AIC   AICc    BIC ar_roots   ma_roots  
##   <chr>         <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <list>     <list>    
## 1 search    0.0000125    619. -1222. -1221. -1198. <cpl [4]>  <cpl [24]>
## 2 stepwise  0.0000131    614. -1215. -1215. -1198. <cpl [1]>  <cpl [25]>
## 3 arima111  0.0000132    613. -1215. -1214. -1197. <cpl [13]> <cpl [13]>
## 4 arimaN111 0.0000135    604. -1199. -1199. -1184. <cpl [1]>  <cpl [25]>

The search function which is an automated function that does not utilize stepwise and searches for all possiblities came up with the best model Arima(4,0,0)(0,1,2) with drift.

ar_pwr_fit %>% 
   dplyr::select(search) %>%
    gg_tsresiduals()

The residuals look like white noise on the search model.

augment(ar_pwr_fit) %>%
  filter(.model=='search') %>%
  features(.innov, ljung_box, lag = 10, dof = 3)

## # A tibble: 1 x 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 search    12.6    0.0825

The large p-value is large enough to for us to fail to reject the null hypothesis.

bind_rows(
    pwr_arima %>% accuracy(),
    ets_pwr %>% accuracy(),
    ets_pwr_manual %>% accuracy(),
    ar_pwr_fit %>% dplyr::select(search) %>% accuracy(),
    pwr_arima %>% forecast(h = 35) %>% accuracy(test_pwr),
    ets_pwr %>% forecast(h = 35) %>% accuracy(test_pwr),
    ets_pwr_manual %>% forecast(h=35) %>% accuracy(test_pwr),
    ar_pwr_fit %>% dplyr::select(search) %>% forecast(h=35) %>% accuracy(test_pwr)
  ) %>%
  select(-ME, -MPE, -ACF1)

## # A tibble: 8 x 7
##   .model                             .type     RMSE    MAE  MAPE    MASE   RMSSE
##   <chr>                              <chr>    <dbl>  <dbl> <dbl>   <dbl>   <dbl>
## 1 "ARIMA(box_cox(value, power_lambd~ Traini~ 5.59e5 4.46e5  7.13   0.770   0.764
## 2 "ETS(box_cox(value, power_lambda)~ Traini~ 5.54e5 4.16e5  6.51   0.717   0.758
## 3 "ETS(box_cox(value, power_lambda)~ Traini~ 5.56e5 4.11e5  6.43   0.710   0.761
## 4 "search"                           Traini~ 5.49e5 4.41e5  7.01   0.761   0.751
## 5 "ARIMA(box_cox(value, power_lambd~ Test    1.01e6 7.53e5  9.38 NaN     NaN    
## 6 "ETS(box_cox(value, power_lambda)~ Test    1.08e6 8.59e5 10.9  NaN     NaN    
## 7 "ETS(box_cox(value, power_lambda)~ Test    1.07e6 8.48e5 10.7  NaN     NaN    
## 8 "search"                           Test    1.02e6 7.63e5  9.50 NaN     NaN

The non-search ARIMA performed slightly better than the search model but I will elect to use the search model due to the residuals produced.

Forcast 2014

I will do two approaches for this forecast, one with the chosen testing model and one with the entire data series used to train. Using the entire series changes the auto generated models. The training/testing above helped narrow the decision making process from considering both ETS and/or Arima, to only Arima. The testing model also validated the use of box cox transformation and the need for first order seasonal differencing.

Also, I recreated the above steps and the search model is still the best comparing both Arima models. The search model does create residuals that is considered white noise.

Forecast with Chosen Testing Model

pwr_forecast <- ar_pwr_fit %>%
                dplyr::select(search) %>%
                forecast(h=48) %>%
                autoplot(ts_imp_pwr)
pwr_forecast

Using the entire series for production, we see an altered autogenerated model from: Old -> Model: ARIMA(1,0,1)(0,1,2)[12] w/ drift New -> Model: ARIMA(1,0,0)(2,1,0)[12] w/drift

pwr_arima1 <- ts_imp_pwr %>% model(search = ARIMA((box_cox(value, power_lambda )), stepwise=FALSE))

pwr_arima1 %>% forecast(h=12) %>% autoplot(ts_imp_pwr)

Examining the different set of values, I see that the test model has consistently lower forecasting values than the production model which has higher forecasting predictions.

test_values <- ar_pwr_fit %>%
                dplyr::select(search) %>%
                forecast(h=48)
prod_values <- pwr_arima1 %>% forecast(h=12)

Excel Output

forecast2_df <- tibble(date = prod_values$index, KWH_Prod = prod_values$.mean)
write.csv(forecast2_df,"C:/Users/gabre/OneDrive/Desktop/D624/Project1PartB.csv", row.names = FALSE)

Sources: https://www.statology.org/r-list-object-cannot-be-coerced-to-type-double/ https://steffenmoritz.github.io/imputeTS/ https://mran.microsoft.com/snapshot/2017-12-11/web/packages/imputeTS/vignettes/imputeTS-Time-Series-Missing-Value-Imputation-in-R.pdf