library(fpp2) #in case of fpp3 package issues
library(fpp3)
library(forecast)
library(ggplot2)
library(naniar) #visualize missing data
library(inspectdf) #high level histogram
library(skimr) #high level data statistics
library(car) #cook's distance (outlier handling)
library(imputeTS) #impute missing time series values
library(TSstudio) #plot time series
library(astsa) #(S)ARIMA
The purpose of our first project is to explore Time Series, Decomposition, Forecasting, Data Preprocessing / Overfitting, Exponential smoothing, and ARIMA. To see if / where we might apply the concepts we’ve covered upto this point, provided different data sets and asks.
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is provided in a single Excel file. The variable ‘Cash’ is provided in hundreds of dollars. Other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling.
To start, we read in our data, specifying column type, bringing column names into a consistent (lowercase) form, converting the date
column to be a date object, ordering our observations based on date and atm #, and finally converting all atm entries to lowercase.
The resulting first 6 entries (head), last 6 entries (tail), and summary statistics follow:
#read in the Excel file using read_excel()
<- readxl::read_excel("ATM624Data.xlsx", col_types = c("date", "text", "numeric")) %>%
atm_data rename(date=DATE, atm=ATM, cash=Cash) %>% #column names --> consistent form
mutate(date = as_date(date)) %>% #convert object to date
arrange(date, atm) #sort based on date, atm #
$atm <- tolower(atm_data$atm) #convert atm entries to lower case
atm_data
head(atm_data)
## # A tibble: 6 x 3
## date atm cash
## <date> <chr> <dbl>
## 1 2009-05-01 atm1 96
## 2 2009-05-01 atm2 107
## 3 2009-05-01 atm3 0
## 4 2009-05-01 atm4 777.
## 5 2009-05-02 atm1 82
## 6 2009-05-02 atm2 89
tail(atm_data)
## # A tibble: 6 x 3
## date atm cash
## <date> <chr> <dbl>
## 1 2010-05-09 <NA> NA
## 2 2010-05-10 <NA> NA
## 3 2010-05-11 <NA> NA
## 4 2010-05-12 <NA> NA
## 5 2010-05-13 <NA> NA
## 6 2010-05-14 <NA> NA
summary(atm_data)
## date atm cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
From the above output we extend that:
cash
values and the possibility of outliers (ie. 10919.8 >> 155.6).As a natural next step, we explore and deal with missing data. We visualize high-level where our missing data is via vis_miss()
then the specific indices before devising a plan for how to proceed:
#high level visualization of missing data
vis_miss(atm_data)
#specific indices of missing data
which(is.na(atm_data$cash))
## [1] 173 185 194 209 218 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470
## [16] 1471 1472 1473 1474
#visualize entries 173, 185, 194, 209, 218
#atm_data[173,] #atm1
#atm_data[185,] #atm1
#atm_data[194,] #atm2
#atm_data[209,] #atm1
#atm_data[218,] #atm2
#drop NA values
<- atm_data %>% drop_na() atm_data
From the above visualization we see that we have a very low incidence of missing data: 14 entries are situated to the tail end of our data (all May 2010 observations) with 5 entries situated in the 173 - 218 observation range.
Of the entries situated in the 173 - 218 observation range, 3 are for atm1
and 2 are for atm2
. We’ll later replace these missing values via tsclean()
function.
We spread()
our data to put it into an more digestible form, with atm as the key and cash as the value. The column names become the atm # with corresponding observations as the amount of cash withdrawn that day.
We output the resulting series for each ATM:
#put it into a form to be plot (ie. based on ATM #)
<- atm_data %>% spread(atm, cash)
spread_atm_data #spread_atm_data #verify
#time series plot
%>% ggplot(aes(x = date, y = cash, col = atm)) +
atm_data geom_line(show.legend = FALSE) +
facet_wrap(~ atm, ncol = 1, scales = "free_y") +
labs(title = "Hundreds of Dollars of Cash Withdrawn per ATM per Day", x = "Date") +
scale_y_continuous("Cash Withdrawn ($100's)")
There are a number of interesting observations from the plots above:
atm1
and atm2
follow a relatively predictable range of withdrawal where it appears that day of the week plays an important role on the amount withdrawnatm3
is primarily 0 values. It’s as if the ATM just came on line with a small number of cash withdrawals for the last 3 days of data we’ve been provided. We could exclude this data or cast a simple mean() forecast due to the low number of entries.atm4
s distribution may be similar to that of atm1
and atm2
but it’s hard to tell because the “ordinary values” are drowned out by the extreme outlier. We’ll need to handle this outlier to provide an accurate forecast for the 4th ATM.From time series plots we proceed to corresponding histograms:
#histogram
::inspect_num(spread_atm_data) %>%
inspectdfshow_plot()
From the histograms we extend:
atm1
has a normal distribution with a 2nd peak near 0.atm2
has a relatively normal distribution with its most prominent (bimodal) peak at 0.atm3
has nearly all 0 entries with 3 “outlier” events of cash being withdrawn.atm4
has a non-normal (unimodal) distribution with a great proportion at / near 0 and an extreme outlier event at 10920. This may because of the scales / magnitude of the outlier.We explore the outliers further via boxplot:
#boxplot
par(mfrow = c(1,4))
for(i in 2:5){ #iterate over atm columns
boxplot(spread_atm_data[i],
main = sprintf("%s", names(spread_atm_data)[i]),
col = "lightblue")
}
From the above boxplot we see that atm1
has numerous outliers, atm2
has none, atm3
has 3 (for each non-zero observation), and atm4
has 3 outliers with one of them being rather extreme. Outlier-handling and/or transformation will prove important for forecasting.
And this completes our initial EDA. Next, we proceed to the preparation of our data.
What follows is a section where we handle outliers and transform / normalize our data (where applicable), build out a couple models, and then select the strongest performing between models based on AIC (akaike information criterion). We do so for each of our ATMs, where a simple mean / average model will be used for ATM 3 as a result of limited data.
We start by excluding ATM 3 from this portion since later we’ll utilize a simple mean forecast:
<- dplyr::select(spread_atm_data, -c(atm3))
simpler_spread #simpler_spread
Earlier, our box plots highlighted the presence of an extreme outlier in atm4
and our summary statistics identified missing values for atm1
and atm2
. Missing values and outliers are both problematic for model building and so we proceed to handle both in one fell swoop via tsclean()
function while verifying via summary statistics:
#interpolate atm1 and atm2's missing values while replacing atm4's extreme outlier
$atm1 <- tsclean(simpler_spread$atm1, replace.missing = TRUE)
simpler_spread$atm2 <- tsclean(simpler_spread$atm2, replace.missing = TRUE)
simpler_spread$atm3 <- tsclean(simpler_spread$atm3, replace.missing = TRUE)
simpler_spread$atm4 <- tsclean(simpler_spread$atm4, replace.missing = TRUE)
simpler_spread
summary(simpler_spread) #verify
## date atm1 atm2 atm4
## Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : 1.563
## 1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 26.00 1st Qu.: 124.334
## Median :2009-10-30 Median : 91.00 Median : 67.00 Median : 402.770
## Mean :2009-10-30 Mean : 84.15 Mean : 62.59 Mean : 444.757
## 3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 704.192
## Max. :2010-04-30 Max. :180.00 Max. :147.00 Max. :1712.075
We observe, no more NA values and a drastically reduced maximum value for atm4
and so it appears our use of tsclean()
was successful. For more background on how the function works we can consult the rdocumentation.
As a next step, we explore our time series data to determine proper models / handling per individual ATM.
We start by decomposing the ATM1
series:
#convert to tsibble object
%>%
simpler_spread as_tsibble(index = date) -> ss_ts
#decompose series
%>% model(classical_decomposition(atm1, type = "add")) %>%
ss_ts components() %>%
autoplot() +
labs(title = "Classical additive decomposition for ATM1 cash withdrawal")
There’s no clear trend, the random component looks like white noise (until the end), and it appears that we’re dealing with a weekly seasonality.
Next, we explore the corresponding time series, ACF, and PACF:
#plot time series / ACF / PCF
%>%
ss_ts gg_tsdisplay(atm1, plot_type = 'partial') +
labs(title = "Untransformed time series, ACF, and PACF for ATM1 cash withdrawal")
A stationary series has no predictable patterns in the long term. No trend and no seasonality. From the plots it appears that we observe seasonality (every 7, 14, 21 days), confirmed with strong autocorrelation and exponential decay in the partial correlation values (every 7 days).
While the case for a box cox transformation, to normalize and stabilize variance, is weak, seasonal differencing (D=1) certainly is. We explore whether non-seasonal differencing is:
ndiffs(ss_ts$atm1)
## [1] 0
It appears that ATM1 requires neither box cox transformation nor non-seasonal differencing. I anticipated an ARIMA(x,0,x)(x,1,x) model where p,q, P, and Q are to be determined. To do so with the greatest accuracy and best use of time we’ll use an automatic ARIMA model as compared to an ETS model:
$atm1 %>% ets() ss_ts
## ETS(M,N,N)
##
## Call:
## ets(y = .)
##
## Smoothing parameters:
## alpha = 0.0326
##
## Initial states:
## l = 79.7942
##
## sigma: 0.4352
##
## AIC AICc BIC
## 4785.562 4785.628 4797.262
%>% model(ARIMA(atm1)) -> fit1
ss_ts
report(fit1)
## Series: atm1
## Model: ARIMA(0,0,1)(0,1,2)[7]
##
## Coefficients:
## ma1 sma1 sma2
## 0.1950 -0.5806 -0.1037
## s.e. 0.0546 0.0506 0.0494
##
## sigma^2 estimated as 556.4: log likelihood=-1640.01
## AIC=3288.03 AICc=3288.14 BIC=3303.55
Based on the output AICs from above, we observe that auto ARIMA led to an ARIMA(0,0,1)(0,1,2) model with an AIC of 3288 which was nearly 1.5x less than that of the ETS model. Being that a lower AIC value is indicative of a stronger model, we elect the ARIMA model to forecast for ATM1.
We cast our forecasts toward the conclusion of Part A and proceed with the decomposition of ATM2
:
#decompose series
%>% model(classical_decomposition(atm2, type = "add")) %>%
ss_ts components() %>%
autoplot() +
labs(title = "Classical additive decomposition for ATM2 cash withdrawal")
For ATM2
we observe a slight downard trend, a random component that also looks like white noise (until the end), and what appears to be a weekly seasonality.
To confirm or deny, we consult the time series, ACF, and PACF:
#plot time series / ACF / PCF
%>%
ss_ts gg_tsdisplay(atm2, plot_type = 'partial')
From the plots it appears that we observe seasonality (every 7, 14, 21 days), confirmed with strong autocorrelation and exponential decay in the partial correlation values (every 7 days). Additionally, we observe spikes at days 2 and 5 that appear to repeat in the ACF and PACF plots.
While the case for a box cox transformation, to normalize and stabilize variance, is weak, seasonal differencing (D=1) certainly is. We explore whether non-seasonal differencing is:
ndiffs(ss_ts$atm2)
## [1] 1
It appears that ATM2 does not require box cox transformation but does require non-seasonal differencing (d=1). I anticipated an ARIMA(x,1,x)(x,1,x) model where p,q, P, and Q are to be determined. To do so with the greatest accuracy and best use of time we’ll use an automatic ARIMA model as compared to an ETS model:
$atm2 %>% ets() ss_ts
## ETS(A,A,N)
##
## Call:
## ets(y = .)
##
## Smoothing parameters:
## alpha = 7e-04
## beta = 7e-04
##
## Initial states:
## l = 78.8479
## b = -0.0342
##
## sigma: 38.286
##
## AIC AICc BIC
## 4820.352 4820.519 4839.851
%>% model(ARIMA(atm2)) -> fit2
ss_ts
report(fit2)
## Series: atm2
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4320 -0.9130 0.4773 0.8048 -0.7547
## s.e. 0.0553 0.0407 0.0861 0.0556 0.0381
##
## sigma^2 estimated as 602.5: log likelihood=-1653.67
## AIC=3319.33 AICc=3319.57 BIC=3342.61
Auto ARIMA led to an ARIMA(2,0,2)(0,1,1) model with an AIC of 3319 which was nearly 1.5x less than that of the ETS model. Being that a lower AIC value is indicative of a stronger model, we elect the ARIMA model to forecast for ATM2.
Later, at the conclusion of Part A, we cast our forecasts, but first we proceed to ATM4
’s decomposition:
#decompose series
%>% model(classical_decomposition(atm4, type = "add")) %>%
ss_ts components() %>%
autoplot() +
labs(title = "Classical additive decomposition for ATM4 cash withdrawal")
There’s no apparent trend to the data, the random component looks completely like white noise, and there’s also a clear weekly seasonality to the data.
We follow up by consulting the corresponding time series, ACF, PACF, and number of non-seasonal differences:
#plot time series / ACF / PCF
%>%
ss_ts gg_tsdisplay(atm4, plot_type = 'partial')
ndiffs(ss_ts$atm4)
## [1] 0
ATM 4’s output appears to be quite different than ATM 1 and 2. The magnitude of its spikes are greater as is the mean. While there’s no need for non-seasonal differencing, I believe some of the variability in data may be reigned in via application of box cox transformation. Additionally, it appears there is a case for seasonal differencing based on the ACF and PACF plots.
I anticipated an ARIMA(x,0,x)(x,1,x) model where p, q, P, and Q are to be determined. To do so with the greatest accuracy and best use of time we’ll use an automatic ARIMA model as compared to an ETS model:
$atm4 %>% ets(lambda = BoxCox.lambda(BoxCox.lambda(ss_ts$atm4))) ss_ts
## ETS(A,N,N)
##
## Call:
## ets(y = ., lambda = BoxCox.lambda(BoxCox.lambda(ss_ts$atm4)))
##
## Box-Cox transformation: lambda= 1
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 443.8374
##
## sigma: 351.5847
##
## AIC AICc BIC
## 6437.046 6437.112 6448.746
#ss_ts %>% model(ARIMA(atm4)) -> fit4
%>% model(ARIMA(atm4)) -> fit4
ss_ts
report(fit4)
## Series: atm4
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sar1 constant
## -0.8778 -0.6769 0.1700 0.9634 0.8065 0.2484 795.0206
## s.e. 0.1004 0.1186 0.0564 0.0906 0.1079 0.0556 48.8937
##
## sigma^2 estimated as 117528: log likelihood=-2645.21
## AIC=5306.42 AICc=5306.82 BIC=5337.62
Auto ARIMA led to an ARIMA(3,0,2)(1,0,0) model with an AIC of 5306. While this AIC score is much higher than that of ATM1
and ATM2
it is still significantly lower by comparison to the ETS model and so we elect the ARIMA model to forecast for ATM$. It appears that the box cox transformation did not improve the ETS forecast.
Prior to casting our forecasts, we mustn’t forget ATM 3. Being that we only had 3 data points to work with, I elected to take the mean of these 3 values to provide a forecast:
#select ATM3 column
<- dplyr::select(spread_atm_data, c(atm3))
simpler_spread_atm3
#select only non-zero rows
<- tail(simpler_spread_atm3, n=3)
simple_atm3
#take the mean
<- mean(simple_atm3$atm3)
atm3_mean atm3_mean
## [1] 87.66667
#forecast
<- ts(rep(atm3_mean), start = 1, end = 31) atm3_forecast
The average value for ATM3 was 87.66667 and so we forecast this value for the entire month of May. Note: while not plot below, it is output in our forecast csv file.
As a final step in this process, we cast our forecast for May of 2010 (31 days), output all forecasts in one plot to verify the output, and then send forecasts to csv as requested of the submittal:
#forecast
%>% forecast(h=31) -> atm1_forecast
fit1 %>% forecast(h=31) -> atm2_forecast
fit2 %>% forecast(h=31) -> atm4_forecast
fit4
#output all forecast plots in one
::grid.arrange(
gridExtraautoplot(atm1_forecast) +
labs(title = "ATM1: ARIMA(0,0,1)(0,1,2)", x = NULL, y = NULL) +
theme(legend.position = "none"),
autoplot(atm2_forecast) +
labs(title = "ATM2: ARIMA(2,0,2)(0,1,1)", x = NULL, y = "100s of Dollars") +
theme(legend.position = "none"),
autoplot(atm4_forecast) +
labs(title = "ATM4: ARIMA(3,0,2)(1,0,0)", x = "Day", y = NULL) +
theme(legend.position = "none"),
top = grid::textGrob("Forecasted ATM withdrawals for May of 2010\n")
)
#send to csv
<- tibble(date = atm1_forecast$date ,atm1 = atm1_forecast$.mean, atm2 = atm2_forecast$.mean, atm3 = atm3_forecast, atm4 = atm4_forecast$.mean)
csv_df write.csv(csv_df,"C:/Users/magnu/Documents/DATA-624//DATA624Proj1PtA_ATM_forecasts.csv", row.names = FALSE)
From the above forecasts, it appears that ATM1 and ATM2 forecasts may be accurate but ATM4 does not appear to follow the flow of the data up until May of 2010 and so my confidence in this forecast is minimal. Maybe a model outside the scope of this class would have performed better here …
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single Excel file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
We read in our data, convert column names to a consistent, lower case form and convert our data into a time series (starting in January of 1998). The resulting first 6 entries (head), last 6 entries (tail), and summary statistics follow:
#read in the Excel file using read_excel()
<- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx") %>%
power_data rename(case_sequence = CaseSequence, month = `YYYY-MMM`, kwh = KWH) #column names --> consistent form
#view(power_data) #verify
#convert to time series
<- ts(power_data$kwh, start = c(1998,1), frequency = 12)
power_ts
head(power_ts)
## Jan Feb Mar Apr May Jun
## 1998 6862583 5838198 5420658 5010364 4665377 6467147
tail(power_ts)
## Jul Aug Sep Oct Nov Dec
## 2013 8415321 9080226 7968220 5759367 5769083 9606304
summary(power_ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
We observe that the series runs from January of 1998 through December of 2013. We’re dealing with 1 NA value, quite the range of kwh
s and our residential power consumption metric appears to rise in time, although a plot would be most useful in verifying. It does not appear we have much of a missing data problem, although it may be interesting to locate where exactly the NA value is situated.
As a natural next step, we explore where exactly the NA value is situated and impute using the na_interpolation()
function:
#specific indices of missing data
ggplot_na_distribution(power_ts)
#impute missing value
#power_ts #verify ts before
<- na_interpolation(power_ts) power_ts
Where there was an NA value for September of 2008 (before), we impute a “reasonable” 6569470 kwh value. I elected to impute rather than drop because it would have messed up the time series. Additionally, I elected the aforementioned function because it claimed to specialize in univariate time series data (which we’re using) and the documentation was clear / promising.
We use the ts_plot
function as a means of exploring our time series data with a different approach to plotting (plotly) than used earlier:
ts_plot(power_ts,
title = "Residential Power Consumption",
Xtitle = "Date",
Ytitle = "KWHs")
It appears that imputation worked, we have an extreme outlier in 2010, and our data is seasonal and trending up.
Outlier-handling will be important. Where it appears that we should have had a large positive number (~10mil KWH), instead we have our minimum value (~770k KWH).
I lean on the tsclean()
function to replace this outlier with a more reasonable number and verify using ts_plot()
:
#remove and replace outlier
##replace outlier using tsclean()
<- tsclean(power_ts, replace.missing = TRUE)
power_ts_cleaned
##verify
ts_plot(power_ts_cleaned,
title = "Residential Power Consumption - Cleaned",
Xtitle = "Date",
Ytitle = "KWHs")
With NA and outlier values handled, we proceed to data preparation and model-building.
To familiarize ourselves with the types of transformations that may be necessary for our data, we first decompose the series:
<- decompose(power_ts_cleaned)
d_ptc plot(d_ptc)
Our data has an upward trend and seasonality. To address the upward trend we can apply a box cox or log transformation, whereas to address the seasonality we can difference our data.
We perform a box cox transformation and consult the resulting plot:
#-0.1460788
<- BoxCox(power_ts_cleaned, lambda=BoxCox.lambda(power_ts_cleaned))
bc_ptc
autoplot(bc_ptc)
With a lambda of -0.1460788, it looks like the variance we’d earlier noted in our data has been stabilized.
Being that seasonal differencing (D=1) will be required, we explore whether non-seasonal differencing is:
ndiffs(bc_ptc)
## [1] 1
It appears that non-seasonal differencing is required.
To observe the difference of an ETS (exponential smoothing) v. auto-ARIMA v. seasonal-ARIMA model, we train test split our data and use the MAPE (mean absolute percent error) of the test set for model selection.
We train test split our data, build an ETS model, and output the model’s training statistics and test set MAPE:
#train test split time series - using 2013
<- ts_split(power_ts_cleaned, sample.out = 12)
split_power
<- split_power$train
training <- split_power$test
testing
#length(training)
#length(testing) #12
#ets model
<- ets(training, lambda=BoxCox.lambda(power_ts_cleaned))
ets_model summary(ets_model)
## ETS(A,N,A)
##
## Call:
## ets(y = training, lambda = BoxCox.lambda(power_ts_cleaned))
##
## Box-Cox transformation: lambda= -0.1461
##
## Smoothing parameters:
## alpha = 0.191
## gamma = 1e-04
##
## Initial states:
## l = 6.14
## s = -0.0059 -0.0281 -0.0125 0.0182 0.0264 0.0192
## 0.0026 -0.0237 -0.0188 -0.0068 0.0073 0.022
##
## sigma: 0.0092
##
## AIC AICc BIC
## -736.3954 -733.4686 -688.5011
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 64913.73 582164 439112.8 0.2910452 6.761764 0.7086627 0.2478782
<- forecast(ets_model, h=length(testing))
ets_forecast #abs(testing - ets_forecast$mean) / testing # verify
MAPE(ets_forecast$mean, testing) / 100
## [1] 0.9742289
If we consult the training set MAPE, we observe 6.76%. When we then consult the testing set MAPE, after dividing by 100, we get 0.974%. Less than 1% MAPE seems too good to be true and is indicative of a very strong predictive performance.
We proceed to the seasonal ARIMA model for comparison. We apply auto-arima to provide accurate model variables (p,d,q,P,D,Q), train our SARIMA model, and consult the test set MAPE:
#auto.arima(training)
<- sarima.for(training, n.ahead=length(testing), p=1,d=0,q=1,P=2,D=1,Q=0,S=12) sarima_forecast
MAPE(sarima_forecast$pred, testing) / 100
## [1] 0.9962629
For the SARIMA model we get a testing MAPE of 0.996%. While on the surface this and the plot look quite good, it appears to be slightly worse than our exponential smoothing model.
I choose the exponential smoothing model to forecast with by this most incremental of margins. The exponential smoothing model had an ever-so-slightly smaller MAPE which is indicative of lower error for the data we foresee. While the margin between models was miniscule, I stand by the principle that forecasting v. a verification set provides a strong vote of confidence for the performance of our elected model.
As a final step, we train our model on all the data, forecast for 2014, output the corresponding plot, and send the forecast to csv:
#retrain the model (accounting for 2013 data this time)
<- ets(power_ts_cleaned, lambda=BoxCox.lambda(power_ts_cleaned))
ets_fit
#forecast
%>% forecast(h=12) -> power_forecast
ets_fit
#output forecast plot
autoplot(power_forecast)
#list power_forecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 9411722 8324368 10664993 7807444 11405313
## Feb 2014 8089025 7168809 9147181 6730575 9771108
## Mar 2014 6689187 5943870 7543594 5588107 8046177
## Apr 2014 5912700 5261607 6657815 4950422 7095526
## May 2014 5634103 5014662 6342836 4718554 6759102
## Jun 2014 7427261 6575098 8408420 6169664 8987525
## Jul 2014 8696137 7671753 9880372 7185858 10581547
## Aug 2014 9236485 8134187 10513405 7612135 11270666
## Sep 2014 8193854 7227144 9311687 6768687 9973670
## Oct 2014 5981183 5301887 6762111 4978325 7222484
## Nov 2014 5625277 4988731 6356661 4685409 6787644
## Dec 2014 7070354 6213958 8064771 5809085 8655587
#generate date sequence for csv
<- yearmonth(seq(as.Date("2014/1/1"), by = "month", length.out = 12))
power_forecast.dates
#generate df for csv
<- tibble(date = power_forecast.dates, kwh = power_forecast$mean)
csv_df2
#write to csv
write.csv(csv_df2,"C:/Users/magnu/Documents/DATA-624//DATA624Proj1PtB_power_forecast.csv", row.names = FALSE)
The plot above highlights an “in range” forecast. It appears that an exponential smoothing model’s greater weight on recent data works in this instance of power consumption.
While there is greater variability in the data prior to our forecast, our forecast captures the seasonality of what appear to be higher demand in the summer and winter while sitting between the peaks and troughs of more recent years.