Project_1_Data

R Markdown

library(readxl)
library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.5
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.3.2
## ✔ lubridate   1.9.3     ✔ fable       0.3.4
## ✔ ggplot2     3.5.0     ✔ fabletools  0.4.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(ggfortify)
library(DataExplorer)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ readr   2.1.5
## ✔ purrr   1.0.2     ✔ stringr 1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## Registered S3 methods overwritten by 'forecast':
##   method                 from     
##   autoplot.Arima         ggfortify
##   autoplot.acf           ggfortify
##   autoplot.ar            ggfortify
##   autoplot.bats          ggfortify
##   autoplot.decomposed.ts ggfortify
##   autoplot.ets           ggfortify
##   autoplot.forecast      ggfortify
##   autoplot.stl           ggfortify
##   autoplot.ts            ggfortify
##   fitted.ar              ggfortify
##   fortify.ts             ggfortify
##   residuals.ar           ggfortify

library(dplyr)
library(writexl)

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part A

Data Exploration

Firstly, we’ll want to load in the data and explore it. Initial I was going to place the files in github however since I would need to convert them to csv, I was a bit worried about issues with the formating (ex. time and date) so instead the files used will be in my local drive.

#part a dataset
ATMDATA <- read_excel("ATM624Data.xlsx")
head(ATMDATA)

## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

#part b dataset
RCFL <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
head(RCFL)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

From looking at the data, we can see that the “DATE” variable is a data type called “POSIXct”. This will need to be converted to data most likely.

summary(ATMDATA)

##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10919.8  
##                                     NA's   :19

plot_missing(ATMDATA)

There are a few missing values for “ATM” but I believe these were left blank on purpose in order to enter the predictions later on for the month of May. Additionally, there are also a few missing values from “Cash” that we may need to address.

ATMDATA[!complete.cases(ATMDATA),]

## # A tibble: 19 × 3
##     DATE ATM    Cash
##    <dbl> <chr> <dbl>
##  1 39977 ATM1     NA
##  2 39980 ATM1     NA
##  3 39982 ATM2     NA
##  4 39986 ATM1     NA
##  5 39988 ATM2     NA
##  6 40299 <NA>     NA
##  7 40300 <NA>     NA
##  8 40301 <NA>     NA
##  9 40302 <NA>     NA
## 10 40303 <NA>     NA
## 11 40304 <NA>     NA
## 12 40305 <NA>     NA
## 13 40306 <NA>     NA
## 14 40307 <NA>     NA
## 15 40308 <NA>     NA
## 16 40309 <NA>     NA
## 17 40310 <NA>     NA
## 18 40311 <NA>     NA
## 19 40312 <NA>     NA

head(ATMDATA)

## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

#fix the date
ATMDATA$DATE<-as.Date(ATMDATA$DATE, origin = "1899-12-30")

Of 19 entries(rows), we see that 14 are missing. For simplicity sake these will be removed.

Data Preparation

We will create a separate data set in order to maintain the original data and make all the necessary transformations. First we’ll transform the “DATE” variables into its corresponding data type date. We will also make “Cash” an integer as we know that ATMs only dispense at the dollar amount.

atm <- ATMDATA %>%
  mutate(DATE = as_date(DATE), Cash = as.integer(Cash))
str(atm)

## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
##  $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: int [1:1474] 96 107 82 89 85 90 90 55 99 79 ...

atm1 <- atm %>%
  filter(ATM == "ATM1") %>%
  as_tsibble(index = DATE)


autoplot(atm1) +
  labs(title="ATM1", subtitle="Cash withdrawals per day", y="Hundreds USD")

## Plot variable not specified, automatically selected `.vars = Cash`

Since we are looking at 4 separate ATMs and the data has no location info, they may have varying amounts of cash withdrawn at varying different days. We’ll look at each ATM separately and each will have its own forecast for the month of May in 2010.

#ATM 1
atm_1 <- atm %>%
  filter(ATM == "ATM1") %>%
  as_tsibble(index = DATE)


autoplot(atm_1) +
  labs(title="ATM1", subtitle="Daily Cash Withdrawal", y="USD (Hundreds)")

## Plot variable not specified, automatically selected `.vars = Cash`

#ATM 2
atm_2 <- atm %>%
  filter(ATM == "ATM2") %>%
  as_tsibble(index = DATE)


autoplot(atm_2) +
  labs(title="ATM2", subtitle="Daily Cash Withdrawal", y="USD (Hundreds)")

## Plot variable not specified, automatically selected `.vars = Cash`

#ATM 3
atm_3 <- atm %>%
  filter(ATM == "ATM3") %>%
  as_tsibble(index = DATE)


autoplot(atm_3) +
  labs(title="ATM3", subtitle="Daily Cash Withdrawal", y="USD (Hundreds)")

## Plot variable not specified, automatically selected `.vars = Cash`

#ATM 4
atm_4 <- atm %>%
  filter(ATM == "ATM4") %>%
  as_tsibble(index = DATE)


autoplot(atm_4) +
  labs(title="ATM4", subtitle="Daily Cash Withdrawal", y="USD (Hundreds)")

## Plot variable not specified, automatically selected `.vars = Cash`

Based on the four plots above, we can observe that ATM 1 and ATM 2 are time series that have constant variability over the course of the two years, with no apparent trend and potential seasonality, but we will explore these more in detail with decomposition. However, ATM 3 seems to only have withdrawals for the last few days in the data as anything before the spike was a flat line. Finally, ATM 4” has a very visible outlier, where one particulate time the day’s withdrawal was nearly 3x higher than any other day.

which(atm_4$Cash == 10919)

## [1] 285

The outlier is located on row 285 of the dataset for the 4th ATM time series. We can use the average to impute the number as it looks to be an error.

atm_4[285,3] <- round(mean(atm_4$Cash),0)

autoplot(atm_4) +
  labs(title="ATM4", subtitle="Daily Cash Withdrawal", y="USD (Hundreds)")

## Plot variable not specified, automatically selected `.vars = Cash`

After adjustments, the time series will look similar to the first two ATM time series.

For ATM 1 and ATM 2 there seems to be a few missing values based from below:

#checking datasets for NAs
sum(is.na(atm_1$Cash))

## [1] 3

sum(is.na(atm_2$Cash))

## [1] 2

sum(is.na(atm_3$Cash))

## [1] 0

sum(is.na(atm_4$Cash))

## [1] 0

which(is.na(atm_1$Cash))

## [1] 44 47 53

which(is.na(atm_2$Cash))

## [1] 49 55

The missing values are seen in rows 44,47, and 53 for ATM 1 and rows 49 and 55 for ATM 2.

hist(atm_1$Cash)

hist(atm_2$Cash)

From the histograms we can see that there’s no evident skewness on the “cash” distribution for both ATMs, so I will use the media to impute the values to get a better adjusted value.

atm_1[44,3] <- round(median(atm_1$Cash, na.rm=TRUE),0)
atm_1[47,3] <- round(median(atm_1$Cash, na.rm=TRUE),0)
atm_1[53,3] <- round(median(atm_1$Cash, na.rm=TRUE),0)
atm_2[49,3] <- round(median(atm_2$Cash, na.rm=TRUE),0)
atm_2[55,3] <- round(median(atm_2$Cash, na.rm=TRUE),0)

Next, we’ll look for a Box-Cox transformation for each series as it will help scale down the time series for all the ATMs:

lambda1 <- atm_1%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)

plot_trans1 <- atm_1 %>%
  autoplot(box_cox(Cash, lambda1)) +
  labs(title="ATM 1 Transformed", subtitle="Daily Cash Withdrawal", y="USD")

lambda2 <- atm_2%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)

plot_trans2 <- atm_2 %>%
  autoplot(box_cox(Cash, lambda2)) +
  labs(title="ATM 2 Transformed", subtitle="Daily Cash Withdrawal", y="USD")

lambda3 <- atm_3%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)

## Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
## max(.period, : NA/Inf replaced by maximum positive value

## Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
## max(.period, : NA/Inf replaced by maximum positive value

plot_trans3 <- atm_3 %>%
  autoplot(box_cox(Cash, lambda3)) +
  labs(title="ATM 3 Transformed", subtitle="Daily Cash Withdrawal", y="USD")

lambda4 <- atm_4%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)

plot_trans4 <- atm_4 %>%
  autoplot(box_cox(Cash, lambda4)) +
  labs(title="ATM 4 Transformed", subtitle="Daily Cash Withdrawal", y="USD")

grid.arrange(plot_trans1, plot_trans2, plot_trans3, plot_trans4, nrow = 2)

Build Model

Next, we will look at the decomposiion of each series to see if there is strong seasonality and if differencing will be required for the model. Because the magnitude of the seasonal components do not change with time, it can be assumed that the series is additive.

atm_1%>%
  model(classical_decomposition(box_cox(Cash, lambda1), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Classical additive decomposition of ATM 1")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

atm_2%>%
  model(classical_decomposition(box_cox(Cash, lambda2), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Classical additive decomposition of ATM 2")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

atm_3%>%
  model(classical_decomposition(box_cox(Cash, lambda3), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Classical additive decomposition of ATM 3")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

atm_4%>%
  model(classical_decomposition(box_cox(Cash, lambda4), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Classical additive decomposition of ATM 4")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

Based on the four ATM plots, there is a strong seasonal component for each of the ATMs.Next let’s look at creating autocorrelation plots for each ATM:

plot_acf_1 <- atm_1 %>%
  ACF(box_cox(Cash, lambda1)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM 1")

plot_acf_2 <- atm_2 %>%
  ACF(box_cox(Cash, lambda2)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM 2")

plot_acf_3 <- atm_3 %>%
  ACF(box_cox(Cash, lambda3)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM 3")

plot_acf_4 <- atm_4 %>%
  ACF(box_cox(Cash, lambda4)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM 4")

grid.arrange(plot_acf_1, plot_acf_2, plot_acf_3, plot_acf_4, nrow = 2)

As observed in the ACF plots above, we may need to apply unitroot_nsdiffs() to the daily cash withdrawals for each ATM in order to determine if we need any seasonal differencing by week.

atm_1 %>%
  features(box_cox(Cash, lambda1), unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

atm_2 %>%
  features(box_cox(Cash, lambda2), unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

atm_3 %>%
  features(box_cox(Cash, lambda3), unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

atm_4 %>%
  features(box_cox(Cash, lambda4), unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

As determined by the function above, seasonal differencing is needed on ATM 1 and ATM 2. Let’s explore further to see if we need any additional differencing:

#No additional differencing seems to be needed.
atm_1 %>%
  features(difference(box_cox(Cash, lambda1), 7), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

atm_2 %>%
  features(difference(box_cox(Cash, lambda2), 7), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

atm_3 %>%
  features(box_cox(Cash, lambda3), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

atm_4 %>%
  features(box_cox(Cash, lambda4), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

Let’s take a look at the ACF plots post differencing:

atm_1 %>%
  ACF(difference(box_cox(Cash, lambda1), 7)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM 1")

atm_2 %>%
  ACF(difference(box_cox(Cash, lambda2), 7)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM 2")

Difference seems to have made the data look closer to white noise as less spikes are past the boundaries.

Considering that the data needed differencing, we will use the ARIMA() model, which applies differencing. For ATM 3 we will use a naive model that takes the last observation to forecast. As there are just three values, this seems to be the most appropriate method.

atm_1_fit <- atm_1 %>%
  model(ARIMA(box_cox(Cash, lambda1)))

report(atm_1_fit)

## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda1) 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.0994  -0.1141  -0.6456
## s.e.  0.0524   0.0527   0.0426
## 
## sigma^2 estimated as 1.576:  log likelihood=-589.77
## AIC=1187.54   AICc=1187.66   BIC=1203.07

atm_1_fit %>%
  gg_tsresiduals()

atm_2_fit <- atm_2 %>%
  model(ARIMA(box_cox(Cash, lambda2)))

report(atm_2_fit)

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda2) 
## 
## Coefficients:
##           ar1      ar2    ma1     ma2     sma1
##       -0.4247  -0.9074  0.468  0.8004  -0.7262
## s.e.   0.0587   0.0438  0.089  0.0584   0.0409
## 
## sigma^2 estimated as 69.93:  log likelihood=-1267.92
## AIC=2547.85   AICc=2548.09   BIC=2571.13

atm_2_fit %>%
  gg_tsresiduals()

atm_3_fit <- atm_3 %>%
  model(NAIVE(box_cox(Cash, lambda3)))

report(atm_3_fit)

## Series: Cash 
## Model: NAIVE 
## Transformation: box_cox(Cash, lambda3) 
## 
## sigma^2: 396.8997

atm_3_fit %>%
  gg_tsresiduals()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

atm_4_fit <- atm_4 %>%
  model(ARIMA(box_cox(Cash, lambda4)))

report(atm_4_fit)

## Series: Cash 
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, lambda4) 
## 
## Coefficients:
##          ma1    sar1    sar2  constant
##       0.0761  0.2091  0.1986   17.3297
## s.e.  0.0529  0.0517  0.0525    0.7524
## 
## sigma^2 estimated as 187.5:  log likelihood=-1471.55
## AIC=2953.1   AICc=2953.27   BIC=2972.6

atm_4_fit %>%
  gg_tsresiduals()

All residuals, except for ATM 3 show have constant variability, seem to be white noise, and have approximately close to normal distributions.

Forecasts

For the forecasts we will save each of the ATMs in its own .csv file using write.csv:

forecast_atm_1 <- atm_1_fit %>% forecast(h=31)
forecast_atm_2 <- atm_2_fit %>% forecast(h=31)
forecast_atm_3 <- atm_3_fit %>% forecast(h=31)
forecast_atm_4 <- atm_4_fit %>% forecast(h=31)

write.csv(forecast_atm_1,"forecast_atm_1.csv")
write.csv(forecast_atm_2,"forecast_atm_2.csv")
write.csv(forecast_atm_3,"forecast_atm_3.csv")
write.csv(forecast_atm_4,"forecast_atm_4.csv")

Part B - Forecasting Power

Part B consisx of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

head(RCFL)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

summary(RCFL)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

# Let's locate any missing values
which(is.na(RCFL), arr.ind=TRUE)

##      row col
## [1,] 129   3

# there are in roww 861 - 2008-Sept
slice(RCFL,c(127:132))

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          859 2008-Jul   7643987
## 2          860 2008-Aug   8037137
## 3          861 2008-Sep        NA
## 4          862 2008-Oct   5101803
## 5          863 2008-Nov   4555602
## 6          864 2008-Dec   6442746

RCFL_data <- RCFL %>% rename(Date = 'YYYY-MMM')
RCFL_data <- RCFL_data %>%  mutate(Date = as.Date(paste0('01-', Date), '%d-%Y-%b'))
min(RCFL_data$KWH,na.rm = TRUE)

## [1] 770523

RCFL_2 <-ts(RCFL[, "KWH"], start = c(1998, 1), frequency = 12)
ggseasonplot(RCFL_2)+ggtitle('USAGE BY YEAR FOR RESIDENTIAL POWER')

The data does seems to appear seasonal, so I believe we can use mean value of the months of June/November to handle the missing values.

RCFL_data <- RCFL_data[-c(129,151),]

#Get average by month
RCFL_data$Month <- months(RCFL_data$Date)
aggregate(KWH ~ Month, RCFL_data, mean)

##        Month     KWH
## 1      April 5299734
## 2     August 8298211
## 3   December 6283175
## 4   February 6946556
## 5    January 8007422
## 6       July 7852521
## 7       June 6536092
## 8      March 5971450
## 9        May 5039034
## 10  November 4953619
## 11   October 5657164
## 12 September 7702333

RCFL$KWH[is.na(RCFL$KWH)] = median(RCFL$KWH, na.rm=TRUE)

summary(RCFL)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5434539  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6501333  
##  3rd Qu.:876.2                      3rd Qu.: 7608792  
##  Max.   :924.0                      Max.   :10655730

RCFL_ts <- ts(RCFL$KWH, start=c(1998,1), frequency = 12)
RCFL_ts

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755  8607428
## 1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416  7564391
## 2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662  7517830
## 2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129  8450717
## 2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702  8058748
## 2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676  8476499
## 2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931  7309774
## 2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998  7786659
## 2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564  8241110
## 2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220  7447146
## 2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987  8037137
## 2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260  8350517
## 2010  9397357  8390677  7347915  5776131  4919289  6696292   770523  7922701
## 2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343 10308076
## 2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851  9612423
## 2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321  9080226
##           Sep      Oct      Nov      Dec
## 1998  6989888  6345620  4640410  4693479
## 1999  7899368  5358314  4436269  4419229
## 2000  8912169  5844352  5041769  6220334
## 2001  7112069  5242535  4461979  5240995
## 2002  8245227  5865014  4908979  5779958
## 2003  7791791  5344613  4913707  5756193
## 2004  6690366  5444948  4824940  5791208
## 2005  7057213  6694523  4313019  6181548
## 2006  7296355  5104799  4458429  6226214
## 2007  7666970  5785964  4907057  6047292
## 2008  6283324  5101803  4555602  6442746
## 2009  7583146  5566075  5339890  7089880
## 2010  7819472  5875917  4800733  6152583
## 2011  8943599  5603920  6154138  8273142
## 2012  7559148  5576852  5731899  6609694
## 2013  7968220  5759367  5769083  9606304

ggtsdisplay(RCFL_ts, main="Monthly Power Consumption Pre- Transformation")

The ACF plot does not indicate pure white noise, however the PACF shows more as there are much less spikes that pass the boundary, naming only in the initial two lags.

Box-Cox Transformation

RCFLS_BXCX <- RCFL_ts %>% BoxCox(lambda= 'auto')
ggtsdisplay(RCFLS_BXCX, main='MONTHLY POWER CONSUMER BXCX')

#Exploration through a seasonal plot
ggseasonplot(RCFLS_BXCX)

summary(RCFLS_BXCX)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.80   38.69   39.42   39.47   40.41   42.20

ggsubseriesplot(RCFLS_BXCX)

ggAcf(RCFLS_BXCX)

Using the box-test we can take a closer look:

Box.test(RCFLS_BXCX, type = c("Ljung-Box"))

## 
##  Box-Ljung test
## 
## data:  RCFLS_BXCX
## X-squared = 16.556, df = 1, p-value = 4.723e-05

summary(RCFLS_BXCX)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.80   38.69   39.42   39.47   40.41   42.20

boxplot(RCFLS_BXCX~cycle(RCFLS_BXCX))

Next, we will being differencing:

print(paste0("Suggested # of diff: ", ndiffs(RCFLS_BXCX)))

## [1] "Suggested # of diff: 1"

print(paste0("DIFF REQUIRED (SEASIONAL): ", ndiffs(diff(RCFLS_BXCX, lag=12))))

## [1] "DIFF REQUIRED (SEASIONAL): 0"

RCFL_PWR_DIFF <- RCFLS_BXCX %>% diff(lag = 12)
ggtsdisplay(RCFL_PWR_DIFF, main= "Monthly power consumption BXCX AND DIFF")

We can see that there is a particularly large spike at the 12th lag in both the ACF and PACF plots, however the PACF does have more spikes that past the boundary.

Using a ggseasonplot, we can view the residential power usage by year through a vissual:

ggseasonplot(RCFL_PWR_DIFF,polar = TRUE)+ggtitle('Residential Power Usage by Year')

plot(RCFL_PWR_DIFF)

We can also view the moving average:

autoplot(RCFL_PWR_DIFF, series="Data")+
  autolayer(ma(RCFL_PWR_DIFF, 12), series = "12 Month Moving Avg")+ ggtitle("2014 MVING AVG")

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

Forecasting with various methods

1. STL - ANN NO DP

#stlf - etsmodel
RCFLS_STL <- stlf(RCFL_PWR_DIFF, damped=FALSE, s.window = "periodic", robust=TRUE, h = 12)

# forecast plot
autoplot(RCFLS_STL) + autolayer(fitted(RCFLS_STL))

2. STL _ DP AADN

#stlf - etsmodel estimation --- M, Ad, N is chosen.
RCFL_STL_DP <- stlf(RCFL_PWR_DIFF, damped=TRUE, s.window = "periodic", robust=TRUE, h = 12)

# forecast plot
autoplot(RCFL_STL_DP) + autolayer(fitted(RCFL_STL_DP))

3. ARIMA

# auto.arima
arima_model <- auto.arima(RCFL_PWR_DIFF)

# forecast values
arima_model <- forecast(arima_model, h=20)

# forecast plot
autoplot(arima_model) + autolayer(fitted(arima_model))

4. ETS MNM

RCFL_ETS<- ets(RCFL_PWR_DIFF)

# forecast plot
autoplot(forecast(RCFL_ETS
                , h=12)) + autolayer(fitted(RCFL_ETS
                                                    ))

- Exponential Smoothing

RCFL_FCST_PWR_S <- ses(RCFL_PWR_DIFF, h=12)
autoplot(RCFL_FCST_PWR_S)+
  autolayer(fitted(RCFL_FCST_PWR_S), series="Fitted")

Now, let’s compare the models:

#STL - ANN NO DP
accuracy(RCFLS_STL)

##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0005455413 1.348221 0.5931651 -1360.432 1605.179 0.5404898
##                   ACF1
## Training set 0.1194029

checkresiduals(RCFLS_STL)

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,N,N)
## Q* = 65.794, df = 24, p-value = 9.303e-06
## 
## Model df: 0.   Total lags used: 24

summary(RCFLS_STL)

## 
## Forecast method: STL +  ETS(A,N,N)
## 
## Model Information:
## ETS(A,N,N) 
## 
## Call:
## ets(y = na.interp(x), model = etsmodel, damped = FALSE, allow.multiplicative.trend = allow.multiplicative.trend)
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 0.0713 
## 
##   sigma:  1.3558
## 
##      AIC     AICc      BIC 
## 1048.295 1048.432 1057.874 
## 
## Error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0005455413 1.348221 0.5931651 -1360.432 1605.179 0.5404898
##                   ACF1
## Training set 0.1194029
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2014     0.17305758 -1.564437 1.910552 -2.484211 2.830326
## Feb 2014     0.06130592 -1.676189 1.798801 -2.595963 2.718575
## Mar 2014     0.12236178 -1.615133 1.859857 -2.534907 2.779631
## Apr 2014     0.08210098 -1.655394 1.819596 -2.575168 2.739370
## May 2014     0.12746350 -1.610031 1.864958 -2.529805 2.784732
## Jun 2014     0.14679543 -1.590699 1.884290 -2.510474 2.804064
## Jul 2014    -0.19208644 -1.929581 1.545408 -2.849355 2.465183
## Aug 2014    -0.02245609 -1.759951 1.715039 -2.679725 2.634813
## Sep 2014     0.16034813 -1.577147 1.897843 -2.496921 2.817617
## Oct 2014     0.05745299 -1.680042 1.794948 -2.599816 2.714722
## Nov 2014     0.08070004 -1.656795 1.818195 -2.576569 2.737969
## Dec 2014     0.05898262 -1.678512 1.796477 -2.598286 2.716252

#STL - DP AADN
accuracy(RCFL_STL_DP)

##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set -0.007629573 1.349456 0.5947813 -1328.026 1577.668 0.5419625
##                  ACF1
## Training set 0.120852

checkresiduals(RCFL_STL_DP)

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,Ad,N)
## Q* = 65.735, df = 24, p-value = 9.492e-06
## 
## Model df: 0.   Total lags used: 24

summary(RCFL_STL_DP)

## 
## Forecast method: STL +  ETS(A,Ad,N)
## 
## Model Information:
## ETS(A,Ad,N) 
## 
## Call:
## ets(y = na.interp(x), model = etsmodel, damped = TRUE, allow.multiplicative.trend = allow.multiplicative.trend)
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     beta  = 1e-04 
##     phi   = 0.9622 
## 
##   Initial states:
##     l = 0.1056 
##     b = -6e-04 
## 
##   sigma:  1.3686
## 
##      AIC     AICc      BIC 
## 1054.625 1055.110 1073.783 
## 
## Error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set -0.007629573 1.349456 0.5947813 -1328.026 1577.668 0.5419625
##                  ACF1
## Training set 0.120852
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2014     0.17900301 -1.574926 1.932932 -2.503400 2.861406
## Feb 2014     0.06761419 -1.686315 1.821544 -2.614789 2.750018
## Mar 2014     0.12901915 -1.624910 1.882949 -2.553385 2.811423
## Apr 2014     0.08909425 -1.664835 1.843024 -2.593310 2.771498
## May 2014     0.13477996 -1.619150 1.888710 -2.547624 2.817184
## Jun 2014     0.15442285 -1.599507 1.908353 -2.527982 2.836828
## Jul 2014    -0.18415983 -1.938091 1.569771 -2.866565 2.498246
## Aug 2014    -0.01424159 -1.768173 1.739690 -2.696648 2.668165
## Sep 2014     0.16883961 -1.585092 1.922772 -2.513568 2.851247
## Oct 2014     0.06621097 -1.687722 1.820144 -2.616198 2.748620
## Nov 2014     0.08971444 -1.664219 1.843648 -2.592696 2.772125
## Dec 2014     0.06824374 -1.685691 1.822179 -2.614168 2.750656

#ARIMA
accuracy(arima_model)

##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.02554754 0.9666583 0.4541852 -1920.696 2160.894 0.4138519
##                     ACF1
## Training set 0.006469743

checkresiduals(arima_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,0,2)[12] with non-zero mean
## Q* = 6.8185, df = 21, p-value = 0.9985
## 
## Model df: 3.   Total lags used: 24

summary(arima_model)

## 
## Forecast method: ARIMA(0,0,1)(0,0,2)[12] with non-zero mean
## 
## Model Information:
## Series: RCFL_PWR_DIFF 
## ARIMA(0,0,1)(0,0,2)[12] with non-zero mean 
## 
## Coefficients:
##          ma1     sma1    sma2    mean
##       0.1195  -0.9264  0.0890  0.0616
## s.e.  0.0712   0.0810  0.0811  0.0198
## 
## sigma^2 = 0.9557:  log likelihood = -257.11
## AIC=524.23   AICc=524.57   BIC=540.19
## 
## Error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.02554754 0.9666583 0.4541852 -1920.696 2160.894 0.4138519
##                     ACF1
## Training set 0.006469743
## 
## Forecasts:
##          Point Forecast     Lo 80      Hi 80     Lo 95     Hi 95
## Jan 2014    -0.71170867 -1.964897  0.5414796 -2.628295 1.2048774
## Feb 2014     0.20176196 -1.060338  1.4638615 -1.728453 2.1319766
## Mar 2014     0.10559912 -1.156500  1.3676986 -1.824616 2.0358137
## Apr 2014    -0.21986269 -1.481962  1.0422368 -2.150077 1.7103519
## May 2014    -0.33880539 -1.600905  0.9232941 -2.269020 1.5914092
## Jun 2014    -0.43704168 -1.699141  0.8250578 -2.367256 1.4931729
## Jul 2014    -1.07257955 -2.334679  0.1895200 -3.002794 0.8576351
## Aug 2014     0.06430436 -1.197795  1.3264039 -1.865910 1.9945190
## Sep 2014     0.14938779 -1.112712  1.4114873 -1.780827 2.0796024
## Oct 2014     0.22911025 -1.032989  1.4912097 -1.701104 2.1593249
## Nov 2014    -0.19797910 -1.460079  1.0641204 -2.128194 1.7322355
## Dec 2014    -1.54532103 -2.807419 -0.2832232 -3.475533 0.3848910
## Jan 2015    -0.02932622 -1.743669  1.6850167 -2.651187 2.5925349
## Feb 2015     0.05006537 -1.669873  1.7700036 -2.580353 2.6804839
## Mar 2015     0.05625769 -1.663681  1.7761959 -2.574161 2.6866762
## Apr 2015     0.08773329 -1.632205  1.8076715 -2.542685 2.7181518
## May 2015     0.10404245 -1.615896  1.8239807 -2.526376 2.7344610
## Jun 2015     0.11673378 -1.603204  1.8366720 -2.513685 2.7471523
## Jul 2015     0.18863975 -1.531298  1.9085780 -2.441779 2.8190583
## Aug 2015     0.06605819 -1.653880  1.7859964 -2.564360 2.6964767

#ETS MNM
accuracy(RCFL_ETS)

##                         ME     RMSE       MAE       MPE     MAPE     MASE
## Training set -1.229798e-05 1.346981 0.6002461 -768.1263 994.3076 0.546942
##                   ACF1
## Training set 0.1202904

checkresiduals(RCFL_ETS)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 66.674, df = 24, p-value = 6.893e-06
## 
## Model df: 0.   Total lags used: 24

summary(RCFL_ETS)

## ETS(A,N,N) 
## 
## Call:
## ets(y = RCFL_PWR_DIFF)
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 0.0719 
## 
##   sigma:  1.3545
## 
##      AIC     AICc      BIC 
## 1047.964 1048.100 1057.543 
## 
## Training set error measures:
##                         ME     RMSE       MAE       MPE     MAPE     MASE
## Training set -1.229798e-05 1.346981 0.6002461 -768.1263 994.3076 0.546942
##                   ACF1
## Training set 0.1202904

#EXP Smoothing
accuracy(RCFL_FCST_PWR_S)

##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0002402688 1.346981 0.6002376 -765.1024 991.1566 0.5469343
##                   ACF1
## Training set 0.1202903

checkresiduals(RCFL_FCST_PWR_S)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 66.674, df = 24, p-value = 6.893e-06
## 
## Model df: 0.   Total lags used: 24

summary(RCFL_FCST_PWR_S)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
## ses(y = RCFL_PWR_DIFF, h = 12)
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 0.0716 
## 
##   sigma:  1.3545
## 
##      AIC     AICc      BIC 
## 1047.964 1048.100 1057.543 
## 
## Error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0002402688 1.346981 0.6002376 -765.1024 991.1566 0.5469343
##                   ACF1
## Training set 0.1202903
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Feb 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Mar 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Apr 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## May 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Jun 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Jul 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Aug 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Sep 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Oct 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726455
## Nov 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726455
## Dec 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726455

From looking at all five models, based on the AIC it seems the ARIMA model performed best. While AIC dropped to525, BIC dropped to 540 as well. RMSE also dropped from 1.347 to 0.966. Since we have chosen the ARIMA model s our pick, we wil finally predict the values in a csv file.

final_results <- forecast(arima_model, h=12)
rslts <- data.frame(final_results)

write.csv(rslts, "results_rcfl.csv", row.names = FALSE)

Project_1_Data_624

Natalie Kalukeerthie

2024-10-15