DATA 624 Project 1

library(readxl)
library(fpp3)
library(ggfortify)
library(DataExplorer)
library(tidyverse)
library(gridExtra)
library(forecast)
library(dplyr)

Project 1 Description

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Mar 26. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

PART A

Data Exploration

We will first load the data, which I have saved in a local folder, and explore it. Reading in the DATA

ATMDATA <- read_excel("ATM624Data.xlsx")
head(ATMDATA)

## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

RCFL <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
head(RCFL)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

As we can observe in the table above, the variable “DATE” is of data type “POSIXct”. We will have to convert it to date in order to work with the data.

summary(ATMDATA)

##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10919.8  
##                                     NA's   :19

plot_missing(ATMDATA)

There are a few missing values for “ATM” but I believe these were left blank on purpose in order to enter the predictions later on for the month of May. Additionally, there are also a few missing values from “Cash” that we may need to address.

ATMDATA[!complete.cases(ATMDATA),]

## # A tibble: 19 × 3
##     DATE ATM    Cash
##    <dbl> <chr> <dbl>
##  1 39977 ATM1     NA
##  2 39980 ATM1     NA
##  3 39982 ATM2     NA
##  4 39986 ATM1     NA
##  5 39988 ATM2     NA
##  6 40299 <NA>     NA
##  7 40300 <NA>     NA
##  8 40301 <NA>     NA
##  9 40302 <NA>     NA
## 10 40303 <NA>     NA
## 11 40304 <NA>     NA
## 12 40305 <NA>     NA
## 13 40306 <NA>     NA
## 14 40307 <NA>     NA
## 15 40308 <NA>     NA
## 16 40309 <NA>     NA
## 17 40310 <NA>     NA
## 18 40311 <NA>     NA
## 19 40312 <NA>     NA

head(ATMDATA)

## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

#fix the date - as you saw previously we need to make some adjustment to fix data

ATMDATA$DATE<-as.Date(ATMDATA$DATE, origin = "1899-12-30")

Note from above that of 19 entries(rows) we see that 14 are missing. For simplicity sake these are supposed to be removed.

While still exploring the data we make a matrix of plox with a given data set

Data Preparation

We will create a separate data set in order to maintain the original data and make all the necessary transformations there. First we’ll transform the “DATE” variables into its corresponding data type date. We will also make “Cash” an integer as we know that ATMs do not dispense cents.

atm <- ATMDATA %>%
  mutate(DATE = as_date(DATE), Cash = as.integer(Cash))
str(atm)

## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
##  $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: int [1:1474] 96 107 82 89 85 90 90 55 99 79 ...

Since we are looking at 4 separate ATMs and the data does not provide their location, they may have varying amounts of cash withdrawn at varying different days. We will work with each ATM separately and each will have its own forecast for the month of May in 2010.

atm1 <- atm %>%
  filter(ATM == "ATM1") %>%
  as_tsibble(index = DATE)


autoplot(atm1) +
  labs(title="ATM1", subtitle="Cash withdrawals per day", y="Hundreds USD")

atm2 <- atm %>%
  filter(ATM == "ATM2") %>%
  as_tsibble(index = DATE)


autoplot(atm2) +
  labs(title="ATM2", subtitle="Cash withdrawals per day", y="Hundreds USD")

atm3 <- atm %>%
  filter(ATM == "ATM3") %>%
  as_tsibble(index = DATE)


autoplot(atm3) +
  labs(title="ATM3", subtitle="Cash withdrawals per day", y="Hundreds USD")

atm4 <- atm %>%
  filter(ATM == "ATM4") %>%
  as_tsibble(index = DATE)


autoplot(atm4) +
  labs(title="ATM4", subtitle="Cash withdrawals per day", y="Hundreds USD")

Based on our plots above, we can observe that “ATM1” and “ATM2” are time series that have constant variability, with no apparent trend and potential seasonality, but we will explore these more in detail with decompositions. However, “ATM3”seems to only have withdrawals for the last few days in the data. Lastly, “ATM4” has a clear outlier.

which(atm4$Cash == 10919)

## [1] 285

The outlier is found on row 285 of the 4th ATM time series. We will use the average to impute this number as it is obvious that this is an error.

The time series looks more like “ATM1” and “ATM2” now:

atm4[285,3] <- round(mean(atm4$Cash),0)

autoplot(atm4) +
  labs(title="ATM4", subtitle="Cash withdrawals per day", y="Hundreds USD")

We also seem to have a few missing values on the “Cash” column for “ATM1” and “ATM2”:

sum(is.na(atm1$Cash))

## [1] 3

sum(is.na(atm2$Cash))

## [1] 2

sum(is.na(atm3$Cash))

## [1] 0

sum(is.na(atm4$Cash))

## [1] 0

The missing values are found on the rows below for each respective ATM

which(is.na(atm1$Cash))

## [1] 44 47 53

which(is.na(atm2$Cash))

## [1] 49 55

hist(atm1$Cash)

hist(atm2$Cash)

Since we don’t see an evident skewness on the distribution of “Cash” for either ATM above, I will use the median to impute these values:

atm1[44,3] <- round(median(atm1$Cash, na.rm=TRUE),0)
atm1[47,3] <- round(median(atm1$Cash, na.rm=TRUE),0)
atm1[53,3] <- round(median(atm1$Cash, na.rm=TRUE),0)
atm2[49,3] <- round(median(atm2$Cash, na.rm=TRUE),0)
atm2[55,3] <- round(median(atm2$Cash, na.rm=TRUE),0)

Additionally, we can look for a Box-Cox transformation for each series to help make them a little simpler:

lambda1 <- atm1%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)

plot_trans1 <- atm1 %>%
  autoplot(box_cox(Cash, lambda1)) +
  labs(title="ATM1 TRANSFORMED", subtitle="Cash withdrawals per day", y="USD")

lambda2 <- atm2%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)

plot_trans2 <- atm2 %>%
  autoplot(box_cox(Cash, lambda2)) +
  labs(title="ATM2 TRANSFORMED", subtitle="Cash withdrawals per day", y="USD")

lambda3 <- atm3%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)

plot_trans3 <- atm3 %>%
  autoplot(box_cox(Cash, lambda3)) +
  labs(title="ATM3 TRANSFORMED", subtitle="Cash withdrawals per day", y="USD")

lambda4 <- atm4%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)

plot_trans4 <- atm4 %>%
  autoplot(box_cox(Cash, lambda4)) +
  labs(title="ATM4 TRANSFORMED", subtitle="Cash withdrawals per day", y="USD")

grid.arrange(plot_trans1, plot_trans2, plot_trans3, plot_trans4, nrow = 2)

The transformations helped scale down the time series of all ATMs.

Build Model

Let’s now look at the decomposition of each series to see if we have strong seasonality and perhaps differencing is required in the model. Since the magnitud of the seasonal components do not seem to change with time, we can say the series are additive.

atm1%>%
  model(classical_decomposition(box_cox(Cash, lambda1), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Classical additive decomposition of ATM1")

atm2%>%
  model(classical_decomposition(box_cox(Cash, lambda2), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Classical additive decomposition of ATM2")

atm3%>%
  model(classical_decomposition(box_cox(Cash, lambda3), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Classical additive decomposition of ATM3")

atm4%>%
  model(classical_decomposition(box_cox(Cash, lambda4), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Classical additive decomposition of ATM4")

As evident on the plots above, we see a strong seasonal component for all ATMs.

plot_acf1 <- atm1 %>%
  ACF(box_cox(Cash, lambda1)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM1")

plot_acf2 <- atm2 %>%
  ACF(box_cox(Cash, lambda2)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM2")

plot_acf3 <- atm3 %>%
  ACF(box_cox(Cash, lambda3)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM3")

plot_acf4 <- atm4 %>%
  ACF(box_cox(Cash, lambda4)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM4")

grid.arrange(plot_acf1, plot_acf2, plot_acf3, plot_acf4, nrow = 2)

As observed in the ACF plots above, we may need to apply unitroot_nsdiffs() to the daily cash withdrawals for each ATM in order to determine if we need any seasonal differencing by week.

atm1 %>%
  features(box_cox(Cash, lambda1), unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

atm2 %>%
  features(box_cox(Cash, lambda2), unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

atm3 %>%
  features(box_cox(Cash, lambda3), unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

atm4 %>%
  features(box_cox(Cash, lambda4), unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

As determined by the function above, we need to apply seasonal differencing to “ATM1” and “ATM2”. Let’s explore further to see if we need any additional differencing:

atm1 %>%
  features(difference(box_cox(Cash, lambda1), 7), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

atm2 %>%
  features(difference(box_cox(Cash, lambda2), 7), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

atm3 %>%
  features(box_cox(Cash, lambda3), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

atm4 %>%
  features(box_cox(Cash, lambda4), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

No additional differencing seems to be needed. Let’s take a look at the ACF plots aftering differencing “ATM1” and “ATM2”.

atm1 %>%
  ACF(difference(box_cox(Cash, lambda1), 7)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM1")

atm2 %>%
  ACF(difference(box_cox(Cash, lambda2), 7)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM2")

Differencing seems to have made the data look closer to white noise.

Considering that these data need differencing, we will use the ARIMA() model, which applies differencing within the algorithm, making it simpler to build.

For “ATM3” we will use the naive model, which takes the last observation to forecast. Given there are only three values, this is a sound approach.

atm1_fit <- atm1 %>%
  model(ARIMA(box_cox(Cash, lambda1)))

report(atm1_fit)

## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda1) 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.0994  -0.1141  -0.6456
## s.e.  0.0524   0.0527   0.0426
## 
## sigma^2 estimated as 1.576:  log likelihood=-589.77
## AIC=1187.54   AICc=1187.66   BIC=1203.07

atm1_fit %>%
  gg_tsresiduals()

atm2_fit <- atm2 %>%
  model(ARIMA(box_cox(Cash, lambda2)))

report(atm2_fit)

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda2) 
## 
## Coefficients:
##           ar1      ar2    ma1     ma2     sma1
##       -0.4247  -0.9074  0.468  0.8004  -0.7262
## s.e.   0.0587   0.0438  0.089  0.0584   0.0409
## 
## sigma^2 estimated as 69.93:  log likelihood=-1267.92
## AIC=2547.85   AICc=2548.09   BIC=2571.13

atm2_fit %>%
  gg_tsresiduals()

atm3_fit <- atm3 %>%
  model(NAIVE(box_cox(Cash, lambda3)))

report(atm3_fit)

## Series: Cash 
## Model: NAIVE 
## Transformation: box_cox(Cash, lambda3) 
## 
## sigma^2: 396.8997

atm3_fit %>%
  gg_tsresiduals()

atm4_fit <- atm4 %>%
  model(ARIMA(box_cox(Cash, lambda4)))

report(atm4_fit)

## Series: Cash 
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, lambda4) 
## 
## Coefficients:
##          ma1    sar1    sar2  constant
##       0.0761  0.2091  0.1986   17.3297
## s.e.  0.0529  0.0517  0.0525    0.7524
## 
## sigma^2 estimated as 187.5:  log likelihood=-1471.55
## AIC=2953.1   AICc=2953.27   BIC=2972.6

atm4_fit %>%
  gg_tsresiduals()

All residuals, except for “ATM3”, look like they have constant variability, seem to be white noise and have approximately close to normal distributions.

Forecasts

Finally, we save the forecast of each ATM in its own .csv document:

forecast_atm1 <- atm1_fit %>% forecast(h=31)
forecast_atm2 <- atm2_fit %>% forecast(h=31)
forecast_atm3 <- atm3_fit %>% forecast(h=31)
forecast_atm4 <- atm4_fit %>% forecast(h=31)

write.csv(forecast_atm1,"forecast_atm1.csv")
write.csv(forecast_atm2,"forecast_atm2.csv")
write.csv(forecast_atm3,"forecast_atm3.csv")
write.csv(forecast_atm4,"forecast_atm4.csv")

PART B

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consisx of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

head(RCFL)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

summary(RCFL)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

LOCATING THE MISSING VALUES

which(is.na(RCFL), arr.ind=TRUE)

##      row col
## [1,] 129   3

# aka 861 - 2008-Sep

slice(RCFL,c(127:132))

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          859 2008-Jul   7643987
## 2          860 2008-Aug   8037137
## 3          861 2008-Sep        NA
## 4          862 2008-Oct   5101803
## 5          863 2008-Nov   4555602
## 6          864 2008-Dec   6442746

#HERE YOU CAN WITNESS THE MISSING (NA) DATA

There is missing data for 2008 SEPT

There seems to be data points near 2010

RCFL_data <- RCFL %>% rename(Date = 'YYYY-MMM')
RCFL_data <- RCFL_data %>%  mutate(Date = as.Date(paste0('01-', Date), '%d-%Y-%b'))
min(RCFL_data$KWH,na.rm = TRUE)

## [1] 770523

RCFL_2 <-ts(RCFL[, "KWH"], start = c(1998, 1), frequency = 12)
ggseasonplot(RCFL_2)+ggtitle('USAGE BY YEAR FOR RESIDENTIAL POWER')

BEING THAT THE DATE APPEARS SEASONAL, I THNINK WE COULD USE MEAN VALUE OF THE MONTHS JUNE / NOV IN ORDER TO HANDLE MISSING

RCFL_data<- RCFL_data[-c(129,151),]

#Get average by month
RCFL_data$Month <- months(RCFL_data$Date)
aggregate(KWH ~ Month, RCFL_data, mean)

##        Month     KWH
## 1      April 5299734
## 2     August 8298211
## 3   December 6283175
## 4   February 6946556
## 5    January 8007422
## 6       July 7852521
## 7       June 6536092
## 8      March 5971450
## 9        May 5039034
## 10  November 4953619
## 11   October 5657164
## 12 September 7702333

RCFL$KWH[is.na(RCFL$KWH)] = median(RCFL$KWH, na.rm=TRUE)

summary(RCFL)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5434539  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6501333  
##  3rd Qu.:876.2                      3rd Qu.: 7608792  
##  Max.   :924.0                      Max.   :10655730

RCFL_ts <- ts(RCFL$KWH, start=c(1998,1), frequency = 12)
RCFL_ts

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755  8607428
## 1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416  7564391
## 2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662  7517830
## 2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129  8450717
## 2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702  8058748
## 2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676  8476499
## 2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931  7309774
## 2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998  7786659
## 2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564  8241110
## 2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220  7447146
## 2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987  8037137
## 2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260  8350517
## 2010  9397357  8390677  7347915  5776131  4919289  6696292   770523  7922701
## 2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343 10308076
## 2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851  9612423
## 2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321  9080226
##           Sep      Oct      Nov      Dec
## 1998  6989888  6345620  4640410  4693479
## 1999  7899368  5358314  4436269  4419229
## 2000  8912169  5844352  5041769  6220334
## 2001  7112069  5242535  4461979  5240995
## 2002  8245227  5865014  4908979  5779958
## 2003  7791791  5344613  4913707  5756193
## 2004  6690366  5444948  4824940  5791208
## 2005  7057213  6694523  4313019  6181548
## 2006  7296355  5104799  4458429  6226214
## 2007  7666970  5785964  4907057  6047292
## 2008  6283324  5101803  4555602  6442746
## 2009  7583146  5566075  5339890  7089880
## 2010  7819472  5875917  4800733  6152583
## 2011  8943599  5603920  6154138  8273142
## 2012  7559148  5576852  5731899  6609694
## 2013  7968220  5759367  5769083  9606304

ggtsdisplay(RCFL_ts, main="Monthly Power Consumption before transform")

BOXCOX TRANSFORM

RCFLS_BXCX <- RCFL_ts %>% BoxCox(lambda= 'auto')
ggtsdisplay(RCFLS_BXCX, main='MONTHLY POWER CONSUMER BXCX')

Exploratory data analysis

ggseasonplot(RCFLS_BXCX)

summary(RCFLS_BXCX)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.80   38.69   39.42   39.47   40.41   42.20

ggsubseriesplot(RCFLS_BXCX)

ggAcf(RCFLS_BXCX)

LETS UTILIZE A BOX TEST TO TAKE A CLOSER LOOK

Box.test(RCFLS_BXCX, type = c("Ljung-Box"))

## 
##  Box-Ljung test
## 
## data:  RCFLS_BXCX
## X-squared = 16.556, df = 1, p-value = 4.723e-05

summary(RCFLS_BXCX)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.80   38.69   39.42   39.47   40.41   42.20

boxplot(RCFLS_BXCX~cycle(RCFLS_BXCX))

DIFFERENCING

print(paste0("Suggested # of diff: ", ndiffs(RCFLS_BXCX)))

## [1] "Suggested # of diff: 1"

print(paste0("DIFF REQUIRED (SEASIONAL): ", ndiffs(diff(RCFLS_BXCX, lag=12))))

## [1] "DIFF REQUIRED (SEASIONAL): 0"

RCFL_PWR_DIFF <- RCFLS_BXCX %>% diff(lag = 12)
ggtsdisplay(RCFL_PWR_DIFF, main= "Monthly power consumption BXCX AND DIFF")

LETS SEE A GRAPHIC FOR RES POWER USAGE BY YEAR

ggseasonplot(RCFL_PWR_DIFF,polar = TRUE)+ggtitle('Residential Power Usage by Year')

plot(RCFL_PWR_DIFF)

###LET SEE A MOVING AVG

autoplot(RCFL_PWR_DIFF, series="Data")+
  autolayer(ma(RCFL_PWR_DIFF, 12), series = "12 MTH Moving Avg")+ ggtitle("2014 MVING AVG")

## Warning: Removed 12 rows containing missing values (`geom_line()`).

Forecast Models

1 STL - ANN NO DP

#stlf - etsmodel
RCFLS_STL <- stlf(RCFL_PWR_DIFF, damped=FALSE, s.window = "periodic", robust=TRUE, h = 12)

# forecast plot
autoplot(RCFLS_STL) + autolayer(fitted(RCFLS_STL))

###2 STL - DP AADN

#stlf - etsmodel estimation --- M, Ad, N is chosen.
RCFL_STL_DP <- stlf(RCFL_PWR_DIFF, damped=TRUE, s.window = "periodic", robust=TRUE, h = 12)

# forecast plot
autoplot(RCFL_STL_DP) + autolayer(fitted(RCFL_STL_DP))

###3 - ARIMA

# auto.arima
arima_model <- auto.arima(RCFL_PWR_DIFF)

# forecast values
arima_model <- forecast(arima_model, h=20)

# forecast plot
autoplot(arima_model) + autolayer(fitted(arima_model))

###4 - ETS MNM

RCFL_ETS<- ets(RCFL_PWR_DIFF)

# forecast plot
autoplot(forecast(RCFL_ETS
                , h=12)) + autolayer(fitted(RCFL_ETS
                                                    ))

5 EXP SMOOTH

RCFL_FCST_PWR_S <- ses(RCFL_PWR_DIFF, h=12)
autoplot(RCFL_FCST_PWR_S)+
  autolayer(fitted(RCFL_FCST_PWR_S), series="Fitted")

COMPARISON OF THE MODELS

accuracy(RCFLS_STL)

##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0005455413 1.348221 0.5931651 -1360.432 1605.179 0.5404898
##                   ACF1
## Training set 0.1194029

checkresiduals(RCFLS_STL)

## Warning in checkresiduals(RCFLS_STL): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,N,N)
## Q* = 65.794, df = 22, p-value = 2.984e-06
## 
## Model df: 2.   Total lags used: 24

summary(RCFLS_STL)

## 
## Forecast method: STL +  ETS(A,N,N)
## 
## Model Information:
## ETS(A,N,N) 
## 
## Call:
##  ets(y = na.interp(x), model = etsmodel, damped = FALSE, allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 0.0713 
## 
##   sigma:  1.3558
## 
##      AIC     AICc      BIC 
## 1048.295 1048.432 1057.874 
## 
## Error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0005455413 1.348221 0.5931651 -1360.432 1605.179 0.5404898
##                   ACF1
## Training set 0.1194029
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2014     0.17305758 -1.564437 1.910552 -2.484211 2.830326
## Feb 2014     0.06130592 -1.676189 1.798801 -2.595963 2.718575
## Mar 2014     0.12236178 -1.615133 1.859857 -2.534907 2.779631
## Apr 2014     0.08210098 -1.655394 1.819596 -2.575168 2.739370
## May 2014     0.12746350 -1.610031 1.864958 -2.529805 2.784732
## Jun 2014     0.14679543 -1.590699 1.884290 -2.510474 2.804064
## Jul 2014    -0.19208644 -1.929581 1.545408 -2.849355 2.465183
## Aug 2014    -0.02245609 -1.759951 1.715039 -2.679725 2.634813
## Sep 2014     0.16034813 -1.577147 1.897843 -2.496921 2.817617
## Oct 2014     0.05745299 -1.680042 1.794948 -2.599816 2.714722
## Nov 2014     0.08070004 -1.656795 1.818195 -2.576569 2.737969
## Dec 2014     0.05898262 -1.678512 1.796477 -2.598286 2.716252

accuracy(RCFL_STL_DP)

##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set -0.007629573 1.349456 0.5947813 -1328.026 1577.668 0.5419625
##                  ACF1
## Training set 0.120852

checkresiduals(RCFL_STL_DP)

## Warning in checkresiduals(RCFL_STL_DP): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,Ad,N)
## Q* = 65.735, df = 19, p-value = 4.636e-07
## 
## Model df: 5.   Total lags used: 24

summary(RCFL_STL_DP)

## 
## Forecast method: STL +  ETS(A,Ad,N)
## 
## Model Information:
## ETS(A,Ad,N) 
## 
## Call:
##  ets(y = na.interp(x), model = etsmodel, damped = TRUE, allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     beta  = 1e-04 
##     phi   = 0.9622 
## 
##   Initial states:
##     l = 0.1056 
##     b = -6e-04 
## 
##   sigma:  1.3686
## 
##      AIC     AICc      BIC 
## 1054.625 1055.110 1073.783 
## 
## Error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set -0.007629573 1.349456 0.5947813 -1328.026 1577.668 0.5419625
##                  ACF1
## Training set 0.120852
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2014     0.17900301 -1.574926 1.932932 -2.503400 2.861406
## Feb 2014     0.06761419 -1.686315 1.821544 -2.614789 2.750018
## Mar 2014     0.12901915 -1.624910 1.882949 -2.553385 2.811423
## Apr 2014     0.08909425 -1.664835 1.843024 -2.593310 2.771498
## May 2014     0.13477996 -1.619150 1.888710 -2.547624 2.817184
## Jun 2014     0.15442285 -1.599507 1.908353 -2.527982 2.836828
## Jul 2014    -0.18415983 -1.938091 1.569771 -2.866565 2.498246
## Aug 2014    -0.01424159 -1.768173 1.739690 -2.696648 2.668165
## Sep 2014     0.16883961 -1.585092 1.922772 -2.513568 2.851247
## Oct 2014     0.06621097 -1.687722 1.820144 -2.616198 2.748620
## Nov 2014     0.08971444 -1.664219 1.843648 -2.592696 2.772125
## Dec 2014     0.06824374 -1.685691 1.822179 -2.614168 2.750656

accuracy(arima_model)

##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.02554754 0.9666583 0.4541852 -1920.696 2160.894 0.4138519
##                     ACF1
## Training set 0.006469743

checkresiduals(arima_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,0,2)[12] with non-zero mean
## Q* = 6.8185, df = 21, p-value = 0.9985
## 
## Model df: 3.   Total lags used: 24

summary(arima_model)

## 
## Forecast method: ARIMA(0,0,1)(0,0,2)[12] with non-zero mean
## 
## Model Information:
## Series: RCFL_PWR_DIFF 
## ARIMA(0,0,1)(0,0,2)[12] with non-zero mean 
## 
## Coefficients:
##          ma1     sma1    sma2    mean
##       0.1195  -0.9264  0.0890  0.0616
## s.e.  0.0712   0.0810  0.0811  0.0198
## 
## sigma^2 = 0.9557:  log likelihood = -257.11
## AIC=524.23   AICc=524.57   BIC=540.19
## 
## Error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.02554754 0.9666583 0.4541852 -1920.696 2160.894 0.4138519
##                     ACF1
## Training set 0.006469743
## 
## Forecasts:
##          Point Forecast     Lo 80      Hi 80     Lo 95     Hi 95
## Jan 2014    -0.71170867 -1.964897  0.5414796 -2.628295 1.2048774
## Feb 2014     0.20176196 -1.060338  1.4638615 -1.728453 2.1319766
## Mar 2014     0.10559912 -1.156500  1.3676986 -1.824616 2.0358137
## Apr 2014    -0.21986269 -1.481962  1.0422368 -2.150077 1.7103519
## May 2014    -0.33880539 -1.600905  0.9232941 -2.269020 1.5914092
## Jun 2014    -0.43704168 -1.699141  0.8250578 -2.367256 1.4931729
## Jul 2014    -1.07257955 -2.334679  0.1895200 -3.002794 0.8576351
## Aug 2014     0.06430436 -1.197795  1.3264039 -1.865910 1.9945190
## Sep 2014     0.14938779 -1.112712  1.4114873 -1.780827 2.0796024
## Oct 2014     0.22911025 -1.032989  1.4912097 -1.701104 2.1593249
## Nov 2014    -0.19797910 -1.460079  1.0641204 -2.128194 1.7322355
## Dec 2014    -1.54532103 -2.807419 -0.2832232 -3.475533 0.3848910
## Jan 2015    -0.02932622 -1.743669  1.6850167 -2.651187 2.5925349
## Feb 2015     0.05006537 -1.669873  1.7700036 -2.580353 2.6804839
## Mar 2015     0.05625769 -1.663681  1.7761959 -2.574161 2.6866762
## Apr 2015     0.08773329 -1.632205  1.8076715 -2.542685 2.7181518
## May 2015     0.10404245 -1.615896  1.8239807 -2.526376 2.7344610
## Jun 2015     0.11673378 -1.603204  1.8366720 -2.513685 2.7471523
## Jul 2015     0.18863975 -1.531298  1.9085780 -2.441779 2.8190583
## Aug 2015     0.06605819 -1.653880  1.7859964 -2.564360 2.6964767

accuracy(RCFL_ETS)

##                         ME     RMSE       MAE       MPE     MAPE     MASE
## Training set -1.229798e-05 1.346981 0.6002461 -768.1263 994.3076 0.546942
##                   ACF1
## Training set 0.1202904

checkresiduals(RCFL_ETS)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 66.674, df = 22, p-value = 2.184e-06
## 
## Model df: 2.   Total lags used: 24

summary(RCFL_ETS)

## ETS(A,N,N) 
## 
## Call:
##  ets(y = RCFL_PWR_DIFF) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 0.0719 
## 
##   sigma:  1.3545
## 
##      AIC     AICc      BIC 
## 1047.964 1048.100 1057.543 
## 
## Training set error measures:
##                         ME     RMSE       MAE       MPE     MAPE     MASE
## Training set -1.229798e-05 1.346981 0.6002461 -768.1263 994.3076 0.546942
##                   ACF1
## Training set 0.1202904

accuracy(RCFL_FCST_PWR_S)

##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0002402688 1.346981 0.6002376 -765.1024 991.1566 0.5469343
##                   ACF1
## Training set 0.1202903

checkresiduals(RCFL_FCST_PWR_S)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 66.674, df = 22, p-value = 2.184e-06
## 
## Model df: 2.   Total lags used: 24

summary(RCFL_FCST_PWR_S)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = RCFL_PWR_DIFF, h = 12) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 0.0716 
## 
##   sigma:  1.3545
## 
##      AIC     AICc      BIC 
## 1047.964 1048.100 1057.543 
## 
## Error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0002402688 1.346981 0.6002376 -765.1024 991.1566 0.5469343
##                   ACF1
## Training set 0.1202903
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Feb 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Mar 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Apr 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## May 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Jun 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Jul 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Aug 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Sep 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Oct 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726455
## Nov 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726455
## Dec 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726455

If you look at ARIMA it based AIC it appears with best result. BIC dropped to 540. AIC it dropped to 524. RMSE has also dropped from 1.347 to 0.966. i think i’ll take ARIMA model on this one. I’ll go ahead and predict the values in csv as I am comfortable with the results of ARIMA.

rslts_2 <- forecast(arima_model, h=12)
rslts_fin <- data.frame(rslts_2)

write.csv(rslts_fin,"results_rcfl.csv", row.names = FALSE)