library(fpp2) #in case of fpp3 package issues
library(fpp3)
library(forecast)
library(ggplot2)
library(naniar) #visualize missing data
library(inspectdf) #high level histogram
library(skimr) #high level data statistics
library(car) #cook's distance (outlier handling)
library(imputeTS) #impute missing time series values
library(TSstudio) #plot time series
library(astsa) #(S)ARIMA

Background

The purpose of our first project is to explore Time Series, Decomposition, Forecasting, Data Preprocessing / Overfitting, Exponential smoothing, and ARIMA. To see if / where we might apply the concepts we’ve covered upto this point, provided different data sets and asks.


Part A - ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is provided in a single Excel file. The variable ‘Cash’ is provided in hundreds of dollars. Other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling.

  • Explain and demonstrate your process, techniques used and not used, and your actual forecast.
  • Please provide a written report of your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file.
  • In addition, please submit the forecast as an Excel readable file.

To start, we read in our data, specifying column type, bringing column names into a consistent (lowercase) form, converting the date column to be a date object, ordering our observations based on date and atm #, and finally converting all atm entries to lowercase.

The resulting first 6 entries (head), last 6 entries (tail), and summary statistics follow:

#read in the Excel file using read_excel()
atm_data <- readxl::read_excel("ATM624Data.xlsx", col_types = c("date", "text", "numeric")) %>%
    rename(date=DATE, atm=ATM, cash=Cash) %>% #column names --> consistent form
    mutate(date = as_date(date)) %>% #convert object to date
    arrange(date, atm) #sort based on date, atm #

atm_data$atm <- tolower(atm_data$atm) #convert atm entries to lower case

head(atm_data)
## # A tibble: 6 x 3
##   date       atm    cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 atm1    96 
## 2 2009-05-01 atm2   107 
## 3 2009-05-01 atm3     0 
## 4 2009-05-01 atm4   777.
## 5 2009-05-02 atm1    82 
## 6 2009-05-02 atm2    89
tail(atm_data)
## # A tibble: 6 x 3
##   date       atm    cash
##   <date>     <chr> <dbl>
## 1 2010-05-09 <NA>     NA
## 2 2010-05-10 <NA>     NA
## 3 2010-05-11 <NA>     NA
## 4 2010-05-12 <NA>     NA
## 5 2010-05-13 <NA>     NA
## 6 2010-05-14 <NA>     NA
summary(atm_data)
##       date                atm                 cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

From the above output we extend that:

  • the first 6 entries show that our data has been properly sorted per date and atm #
  • the last 6 entries are all NA and indicate that may be a pattern to missing data
  • the summary statistics indicate a date range from 2009-05-01 thru 2010-05-14, 19 missing cash values and the possibility of outliers (ie. 10919.8 >> 155.6).

Exploratory Data Analysis (EDA)

As a natural next step, we explore and deal with missing data. We visualize high-level where our missing data is via vis_miss() then the specific indices before devising a plan for how to proceed:

#high level visualization of missing data
vis_miss(atm_data)

#specific indices of missing data
which(is.na(atm_data$cash))
##  [1]  173  185  194  209  218 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470
## [16] 1471 1472 1473 1474
#visualize entries 173, 185, 194, 209, 218
#atm_data[173,] #atm1
#atm_data[185,] #atm1
#atm_data[194,] #atm2
#atm_data[209,] #atm1
#atm_data[218,] #atm2

#drop NA values
atm_data <- atm_data %>% drop_na()

From the above visualization we see that we have a very low incidence of missing data: 14 entries are situated to the tail end of our data (all May 2010 observations) with 5 entries situated in the 173 - 218 observation range.

Of the entries situated in the 173 - 218 observation range, 3 are for atm1 and 2 are for atm2. We’ll later replace these missing values via tsclean() function.

We spread() our data to put it into an more digestible form, with atm as the key and cash as the value. The column names become the atm # with corresponding observations as the amount of cash withdrawn that day.

We output the resulting series for each ATM:

#put it into a form to be plot (ie. based on ATM #)
spread_atm_data <- atm_data %>% spread(atm, cash)
#spread_atm_data #verify

#time series plot
atm_data %>% ggplot(aes(x = date, y = cash, col = atm)) +
    geom_line(show.legend = FALSE) +
    facet_wrap(~ atm, ncol = 1, scales = "free_y") +
    labs(title = "Hundreds of Dollars of Cash Withdrawn per ATM per Day", x = "Date") +
    scale_y_continuous("Cash Withdrawn ($100's)")

There are a number of interesting observations from the plots above:

  • atm1 and atm2 follow a relatively predictable range of withdrawal where it appears that day of the week plays an important role on the amount withdrawn
  • atm3 is primarily 0 values. It’s as if the ATM just came on line with a small number of cash withdrawals for the last 3 days of data we’ve been provided. We could exclude this data or cast a simple mean() forecast due to the low number of entries.
  • atm4s distribution may be similar to that of atm1 and atm2 but it’s hard to tell because the “ordinary values” are drowned out by the extreme outlier. We’ll need to handle this outlier to provide an accurate forecast for the 4th ATM.

From time series plots we proceed to corresponding histograms:

#histogram
inspectdf::inspect_num(spread_atm_data) %>% 
  show_plot()

From the histograms we extend:

  • atm1 has a normal distribution with a 2nd peak near 0.
  • atm2 has a relatively normal distribution with its most prominent (bimodal) peak at 0.
  • atm3 has nearly all 0 entries with 3 “outlier” events of cash being withdrawn.
  • atm4 has a non-normal (unimodal) distribution with a great proportion at / near 0 and an extreme outlier event at 10920. This may because of the scales / magnitude of the outlier.

We explore the outliers further via boxplot:

#boxplot
par(mfrow = c(1,4))

for(i in 2:5){ #iterate over atm columns
    boxplot(spread_atm_data[i], 
            main = sprintf("%s", names(spread_atm_data)[i]), 
            col = "lightblue")
}

From the above boxplot we see that atm1 has numerous outliers, atm2 has none, atm3 has 3 (for each non-zero observation), and atm4 has 3 outliers with one of them being rather extreme. Outlier-handling and/or transformation will prove important for forecasting.

And this completes our initial EDA. Next, we proceed to the preparation of our data.

Data Preparation, Model Building & Selection

What follows is a section where we handle outliers and transform / normalize our data (where applicable), build out a couple models, and then select the strongest performing between models based on AIC (akaike information criterion). We do so for each of our ATMs, where a simple mean / average model will be used for ATM 3 as a result of limited data.

We start by excluding ATM 3 from this portion since later we’ll utilize a simple mean forecast:

simpler_spread <- dplyr::select(spread_atm_data, -c(atm3))
#simpler_spread

Earlier, our box plots highlighted the presence of an extreme outlier in atm4 and our summary statistics identified missing values for atm1 and atm2. Missing values and outliers are both problematic for model building and so we proceed to handle both in one fell swoop via tsclean() function while verifying via summary statistics:

#interpolate atm1 and atm2's missing values while replacing atm4's extreme outlier
simpler_spread$atm1 <- tsclean(simpler_spread$atm1, replace.missing = TRUE)
simpler_spread$atm2 <- tsclean(simpler_spread$atm2, replace.missing = TRUE)
simpler_spread$atm3 <- tsclean(simpler_spread$atm3, replace.missing = TRUE)
simpler_spread$atm4 <- tsclean(simpler_spread$atm4, replace.missing = TRUE)

summary(simpler_spread) #verify
##       date                 atm1             atm2             atm4         
##  Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   :   1.563  
##  1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 26.00   1st Qu.: 124.334  
##  Median :2009-10-30   Median : 91.00   Median : 67.00   Median : 402.770  
##  Mean   :2009-10-30   Mean   : 84.15   Mean   : 62.59   Mean   : 444.757  
##  3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 704.192  
##  Max.   :2010-04-30   Max.   :180.00   Max.   :147.00   Max.   :1712.075

We observe, no more NA values and a drastically reduced maximum value for atm4 and so it appears our use of tsclean() was successful. For more background on how the function works we can consult the rdocumentation.

As a next step, we explore our time series data to determine proper models / handling per individual ATM.

We start by decomposing the ATM1 series:

#convert to tsibble object
simpler_spread %>% 
    as_tsibble(index = date) -> ss_ts

#decompose series
 ss_ts %>% model(classical_decomposition(atm1, type = "add")) %>%
   components() %>%
   autoplot() +
   labs(title = "Classical additive decomposition for ATM1 cash withdrawal")

There’s no clear trend, the random component looks like white noise (until the end), and it appears that we’re dealing with a weekly seasonality.

Next, we explore the corresponding time series, ACF, and PACF:

#plot time series / ACF / PCF
ss_ts %>%
    gg_tsdisplay(atm1, plot_type = 'partial') +
    labs(title = "Untransformed time series, ACF, and PACF for ATM1 cash withdrawal")

A stationary series has no predictable patterns in the long term. No trend and no seasonality. From the plots it appears that we observe seasonality (every 7, 14, 21 days), confirmed with strong autocorrelation and exponential decay in the partial correlation values (every 7 days).

While the case for a box cox transformation, to normalize and stabilize variance, is weak, seasonal differencing (D=1) certainly is. We explore whether non-seasonal differencing is:

ndiffs(ss_ts$atm1)
## [1] 0

It appears that ATM1 requires neither box cox transformation nor non-seasonal differencing. I anticipated an ARIMA(x,0,x)(x,1,x) model where p,q, P, and Q are to be determined. To do so with the greatest accuracy and best use of time we’ll use an automatic ARIMA model as compared to an ETS model:

ss_ts$atm1 %>% ets()
## ETS(M,N,N) 
## 
## Call:
##  ets(y = .) 
## 
##   Smoothing parameters:
##     alpha = 0.0326 
## 
##   Initial states:
##     l = 79.7942 
## 
##   sigma:  0.4352
## 
##      AIC     AICc      BIC 
## 4785.562 4785.628 4797.262
ss_ts %>% model(ARIMA(atm1)) -> fit1

report(fit1)
## Series: atm1 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.1950  -0.5806  -0.1037
## s.e.  0.0546   0.0506   0.0494
## 
## sigma^2 estimated as 556.4:  log likelihood=-1640.01
## AIC=3288.03   AICc=3288.14   BIC=3303.55

Based on the output AICs from above, we observe that auto ARIMA led to an ARIMA(0,0,1)(0,1,2) model with an AIC of 3288 which was nearly 1.5x less than that of the ETS model. Being that a lower AIC value is indicative of a stronger model, we elect the ARIMA model to forecast for ATM1.

We cast our forecasts toward the conclusion of Part A and proceed with the decomposition of ATM2:

#decompose series
ss_ts %>% model(classical_decomposition(atm2, type = "add")) %>%
    components() %>%
    autoplot() +
    labs(title = "Classical additive decomposition for ATM2 cash withdrawal")

For ATM2 we observe a slight downard trend, a random component that also looks like white noise (until the end), and what appears to be a weekly seasonality.

To confirm or deny, we consult the time series, ACF, and PACF:

#plot time series / ACF / PCF
ss_ts %>%
    gg_tsdisplay(atm2, plot_type = 'partial')

From the plots it appears that we observe seasonality (every 7, 14, 21 days), confirmed with strong autocorrelation and exponential decay in the partial correlation values (every 7 days). Additionally, we observe spikes at days 2 and 5 that appear to repeat in the ACF and PACF plots.

While the case for a box cox transformation, to normalize and stabilize variance, is weak, seasonal differencing (D=1) certainly is. We explore whether non-seasonal differencing is:

ndiffs(ss_ts$atm2)
## [1] 1

It appears that ATM2 does not require box cox transformation but does require non-seasonal differencing (d=1). I anticipated an ARIMA(x,1,x)(x,1,x) model where p,q, P, and Q are to be determined. To do so with the greatest accuracy and best use of time we’ll use an automatic ARIMA model as compared to an ETS model:

ss_ts$atm2 %>% ets()
## ETS(A,A,N) 
## 
## Call:
##  ets(y = .) 
## 
##   Smoothing parameters:
##     alpha = 7e-04 
##     beta  = 7e-04 
## 
##   Initial states:
##     l = 78.8479 
##     b = -0.0342 
## 
##   sigma:  38.286
## 
##      AIC     AICc      BIC 
## 4820.352 4820.519 4839.851
ss_ts %>% model(ARIMA(atm2)) -> fit2

report(fit2)
## Series: atm2 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4320  -0.9130  0.4773  0.8048  -0.7547
## s.e.   0.0553   0.0407  0.0861  0.0556   0.0381
## 
## sigma^2 estimated as 602.5:  log likelihood=-1653.67
## AIC=3319.33   AICc=3319.57   BIC=3342.61

Auto ARIMA led to an ARIMA(2,0,2)(0,1,1) model with an AIC of 3319 which was nearly 1.5x less than that of the ETS model. Being that a lower AIC value is indicative of a stronger model, we elect the ARIMA model to forecast for ATM2.

Later, at the conclusion of Part A, we cast our forecasts, but first we proceed to ATM4’s decomposition:

#decompose series
ss_ts %>% model(classical_decomposition(atm4, type = "add")) %>%
    components() %>%
    autoplot() +
    labs(title = "Classical additive decomposition for ATM4 cash withdrawal")

There’s no apparent trend to the data, the random component looks completely like white noise, and there’s also a clear weekly seasonality to the data.

We follow up by consulting the corresponding time series, ACF, PACF, and number of non-seasonal differences:

#plot time series / ACF / PCF
ss_ts %>%
    gg_tsdisplay(atm4, plot_type = 'partial')

ndiffs(ss_ts$atm4)
## [1] 0

ATM 4’s output appears to be quite different than ATM 1 and 2. The magnitude of its spikes are greater as is the mean. While there’s no need for non-seasonal differencing, I believe some of the variability in data may be reigned in via application of box cox transformation. Additionally, it appears there is a case for seasonal differencing based on the ACF and PACF plots.

I anticipated an ARIMA(x,0,x)(x,1,x) model where p, q, P, and Q are to be determined. To do so with the greatest accuracy and best use of time we’ll use an automatic ARIMA model as compared to an ETS model:

ss_ts$atm4 %>% ets(lambda = BoxCox.lambda(BoxCox.lambda(ss_ts$atm4)))
## ETS(A,N,N) 
## 
## Call:
##  ets(y = ., lambda = BoxCox.lambda(BoxCox.lambda(ss_ts$atm4))) 
## 
##   Box-Cox transformation: lambda= 1 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 443.8374 
## 
##   sigma:  351.5847
## 
##      AIC     AICc      BIC 
## 6437.046 6437.112 6448.746
#ss_ts %>% model(ARIMA(atm4)) -> fit4
ss_ts %>% model(ARIMA(atm4)) -> fit4

report(fit4)
## Series: atm4 
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean 
## 
## Coefficients:
##           ar1      ar2     ar3     ma1     ma2    sar1  constant
##       -0.8778  -0.6769  0.1700  0.9634  0.8065  0.2484  795.0206
## s.e.   0.1004   0.1186  0.0564  0.0906  0.1079  0.0556   48.8937
## 
## sigma^2 estimated as 117528:  log likelihood=-2645.21
## AIC=5306.42   AICc=5306.82   BIC=5337.62

Auto ARIMA led to an ARIMA(3,0,2)(1,0,0) model with an AIC of 5306. While this AIC score is much higher than that of ATM1 and ATM2 it is still significantly lower by comparison to the ETS model and so we elect the ARIMA model to forecast for ATM$. It appears that the box cox transformation did not improve the ETS forecast.

Prior to casting our forecasts, we mustn’t forget ATM 3. Being that we only had 3 data points to work with, I elected to take the mean of these 3 values to provide a forecast:

#select ATM3 column
simpler_spread_atm3 <- dplyr::select(spread_atm_data, c(atm3))

#select only non-zero rows
simple_atm3 <- tail(simpler_spread_atm3, n=3)

#take the mean
atm3_mean <- mean(simple_atm3$atm3)
atm3_mean
## [1] 87.66667
#forecast
atm3_forecast <- ts(rep(atm3_mean), start = 1, end = 31)

The average value for ATM3 was 87.66667 and so we forecast this value for the entire month of May. Note: while not plot below, it is output in our forecast csv file.

Forecast

As a final step in this process, we cast our forecast for May of 2010 (31 days), output all forecasts in one plot to verify the output, and then send forecasts to csv as requested of the submittal:

#forecast
fit1 %>% forecast(h=31) -> atm1_forecast
fit2 %>% forecast(h=31) -> atm2_forecast
fit4 %>% forecast(h=31) -> atm4_forecast

#output all forecast plots in one
gridExtra::grid.arrange(
  autoplot(atm1_forecast) + 
    labs(title = "ATM1: ARIMA(0,0,1)(0,1,2)", x = NULL, y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm2_forecast) + 
    labs(title = "ATM2: ARIMA(2,0,2)(0,1,1)", x = NULL, y = "100s of Dollars") +
    theme(legend.position = "none"),
  autoplot(atm4_forecast) + 
    labs(title = "ATM4: ARIMA(3,0,2)(1,0,0)", x = "Day", y = NULL) +
    theme(legend.position = "none"),
  top = grid::textGrob("Forecasted ATM withdrawals for May of 2010\n")
)

#send to csv
csv_df <- tibble(date = atm1_forecast$date ,atm1 = atm1_forecast$.mean, atm2 = atm2_forecast$.mean, atm3 = atm3_forecast, atm4 = atm4_forecast$.mean)
write.csv(csv_df,"C:/Users/magnu/Documents/DATA-624//DATA624Proj1PtA_ATM_forecasts.csv", row.names = FALSE)

From the above forecasts, it appears that ATM1 and ATM2 forecasts may be accurate but ATM4 does not appear to follow the flow of the data up until May of 2010 and so my confidence in this forecast is minimal. Maybe a model outside the scope of this class would have performed better here …


Part B - Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single Excel file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

We read in our data, convert column names to a consistent, lower case form and convert our data into a time series (starting in January of 1998). The resulting first 6 entries (head), last 6 entries (tail), and summary statistics follow:

#read in the Excel file using read_excel()
power_data <- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx") %>%
        rename(case_sequence = CaseSequence, month = `YYYY-MMM`, kwh = KWH) #column names --> consistent form
#view(power_data) #verify

#convert to time series
power_ts <- ts(power_data$kwh, start = c(1998,1), frequency = 12)

head(power_ts)
##          Jan     Feb     Mar     Apr     May     Jun
## 1998 6862583 5838198 5420658 5010364 4665377 6467147
tail(power_ts)
##          Jul     Aug     Sep     Oct     Nov     Dec
## 2013 8415321 9080226 7968220 5759367 5769083 9606304
summary(power_ts)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1

We observe that the series runs from January of 1998 through December of 2013. We’re dealing with 1 NA value, quite the range of kwhs and our residential power consumption metric appears to rise in time, although a plot would be most useful in verifying. It does not appear we have much of a missing data problem, although it may be interesting to locate where exactly the NA value is situated.

Exploratory Data Analysis (EDA)

As a natural next step, we explore where exactly the NA value is situated and impute using the na_interpolation() function:

#specific indices of missing data
ggplot_na_distribution(power_ts)

#impute missing value
#power_ts #verify ts before
power_ts <- na_interpolation(power_ts)

Where there was an NA value for September of 2008 (before), we impute a “reasonable” 6569470 kwh value. I elected to impute rather than drop because it would have messed up the time series. Additionally, I elected the aforementioned function because it claimed to specialize in univariate time series data (which we’re using) and the documentation was clear / promising.

We use the ts_plot function as a means of exploring our time series data with a different approach to plotting (plotly) than used earlier:

ts_plot(power_ts,
        title = "Residential Power Consumption",
        Xtitle = "Date",
        Ytitle = "KWHs")

It appears that imputation worked, we have an extreme outlier in 2010, and our data is seasonal and trending up.

Outlier-handling will be important. Where it appears that we should have had a large positive number (~10mil KWH), instead we have our minimum value (~770k KWH).

I lean on the tsclean() function to replace this outlier with a more reasonable number and verify using ts_plot():

#remove and replace outlier
##replace outlier using tsclean()
power_ts_cleaned <- tsclean(power_ts, replace.missing = TRUE)

##verify
ts_plot(power_ts_cleaned,
        title = "Residential Power Consumption - Cleaned",
        Xtitle = "Date",
        Ytitle = "KWHs")

With NA and outlier values handled, we proceed to data preparation and model-building.

Data Preparation & Model Building

To familiarize ourselves with the types of transformations that may be necessary for our data, we first decompose the series:

d_ptc <- decompose(power_ts_cleaned)
plot(d_ptc)

Our data has an upward trend and seasonality. To address the upward trend we can apply a box cox or log transformation, whereas to address the seasonality we can difference our data.

We perform a box cox transformation and consult the resulting plot:

#-0.1460788
bc_ptc <- BoxCox(power_ts_cleaned, lambda=BoxCox.lambda(power_ts_cleaned))

autoplot(bc_ptc)

With a lambda of -0.1460788, it looks like the variance we’d earlier noted in our data has been stabilized.

Being that seasonal differencing (D=1) will be required, we explore whether non-seasonal differencing is:

ndiffs(bc_ptc)
## [1] 1

It appears that non-seasonal differencing is required.

To observe the difference of an ETS (exponential smoothing) v. auto-ARIMA v. seasonal-ARIMA model, we train test split our data and use the MAPE (mean absolute percent error) of the test set for model selection.

We train test split our data, build an ETS model, and output the model’s training statistics and test set MAPE:

#train test split time series - using 2013
split_power <- ts_split(power_ts_cleaned, sample.out = 12)

training <- split_power$train
testing <- split_power$test

#length(training)
#length(testing) #12

#ets model
ets_model <- ets(training, lambda=BoxCox.lambda(power_ts_cleaned))
summary(ets_model)
## ETS(A,N,A) 
## 
## Call:
##  ets(y = training, lambda = BoxCox.lambda(power_ts_cleaned)) 
## 
##   Box-Cox transformation: lambda= -0.1461 
## 
##   Smoothing parameters:
##     alpha = 0.191 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 6.14 
##     s = -0.0059 -0.0281 -0.0125 0.0182 0.0264 0.0192
##            0.0026 -0.0237 -0.0188 -0.0068 0.0073 0.022
## 
##   sigma:  0.0092
## 
##       AIC      AICc       BIC 
## -736.3954 -733.4686 -688.5011 
## 
## Training set error measures:
##                    ME   RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 64913.73 582164 439112.8 0.2910452 6.761764 0.7086627 0.2478782
ets_forecast <- forecast(ets_model, h=length(testing))
#abs(testing - ets_forecast$mean) / testing # verify
MAPE(ets_forecast$mean, testing) / 100
## [1] 0.9742289

If we consult the training set MAPE, we observe 6.76%. When we then consult the testing set MAPE, after dividing by 100, we get 0.974%. Less than 1% MAPE seems too good to be true and is indicative of a very strong predictive performance.

We proceed to the seasonal ARIMA model for comparison. We apply auto-arima to provide accurate model variables (p,d,q,P,D,Q), train our SARIMA model, and consult the test set MAPE:

#auto.arima(training)

sarima_forecast <- sarima.for(training, n.ahead=length(testing), p=1,d=0,q=1,P=2,D=1,Q=0,S=12)

MAPE(sarima_forecast$pred, testing) / 100
## [1] 0.9962629

For the SARIMA model we get a testing MAPE of 0.996%. While on the surface this and the plot look quite good, it appears to be slightly worse than our exponential smoothing model.

Model Selection & Forecast

I choose the exponential smoothing model to forecast with by this most incremental of margins. The exponential smoothing model had an ever-so-slightly smaller MAPE which is indicative of lower error for the data we foresee. While the margin between models was miniscule, I stand by the principle that forecasting v. a verification set provides a strong vote of confidence for the performance of our elected model.

As a final step, we train our model on all the data, forecast for 2014, output the corresponding plot, and send the forecast to csv:

#retrain the model (accounting for 2013 data this time)
ets_fit <- ets(power_ts_cleaned, lambda=BoxCox.lambda(power_ts_cleaned))

#forecast
ets_fit %>% forecast(h=12) -> power_forecast

#output forecast plot
autoplot(power_forecast)

power_forecast #list
##          Point Forecast   Lo 80    Hi 80   Lo 95    Hi 95
## Jan 2014        9411722 8324368 10664993 7807444 11405313
## Feb 2014        8089025 7168809  9147181 6730575  9771108
## Mar 2014        6689187 5943870  7543594 5588107  8046177
## Apr 2014        5912700 5261607  6657815 4950422  7095526
## May 2014        5634103 5014662  6342836 4718554  6759102
## Jun 2014        7427261 6575098  8408420 6169664  8987525
## Jul 2014        8696137 7671753  9880372 7185858 10581547
## Aug 2014        9236485 8134187 10513405 7612135 11270666
## Sep 2014        8193854 7227144  9311687 6768687  9973670
## Oct 2014        5981183 5301887  6762111 4978325  7222484
## Nov 2014        5625277 4988731  6356661 4685409  6787644
## Dec 2014        7070354 6213958  8064771 5809085  8655587
#generate date sequence for csv
power_forecast.dates <- yearmonth(seq(as.Date("2014/1/1"), by = "month", length.out = 12))

#generate df for csv
csv_df2 <- tibble(date = power_forecast.dates, kwh = power_forecast$mean)

#write to csv
write.csv(csv_df2,"C:/Users/magnu/Documents/DATA-624//DATA624Proj1PtB_power_forecast.csv", row.names = FALSE)

The plot above highlights an “in range” forecast. It appears that an exponential smoothing model’s greater weight on recent data works in this instance of power consumption.

While there is greater variability in the data prior to our forecast, our forecast captures the seasonality of what appear to be higher demand in the summer and winter while sitting between the peaks and troughs of more recent years.