DATA 624 Project 1

Part A | ATM Forecast, ATM624Data.xlsx

In part A, forecast how much cash is taken out of 4 different ATM machines for May 2010.
The data is given in a single file. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. Explain and demonstrate your process, techniques used and not used, and your actual forecast. Data is given via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

Loading the Data

Date was transformed as it was a 5-digit numeric type that can be converted since 00000 would 1/1/1900. The data was then transformed into a tsibble object. The dates were filtered to select data from May 2009 to April 2010, as the last 14 rows for May 2010 are all NA values. There were 19 missing values in the Cash column prior to filtering the dates and now there are only 5 NA values.

# reading in excel file
ATM <- read_excel("ATM624DATA.xlsx", col_types = c('date', 'text', 'numeric')) %>%
  # converting into date format
  mutate(DATE = as_date(DATE)) %>%
  # renaming column name
  rename(Date = DATE) %>%
  # converting to tsibble
  as_tsibble(index = Date, key = ATM) %>%
  # selecting from May 2009 to April 2010
  filter(Date < "2010-05-01")

head(ATM)

## # A tsibble: 6 x 3 [1D]
## # Key:       ATM [1]
##   Date       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-02 ATM1     82
## 3 2009-05-03 ATM1     85
## 4 2009-05-04 ATM1     90
## 5 2009-05-05 ATM1     99
## 6 2009-05-06 ATM1     88

summary(ATM)

##       Date                ATM                 Cash        
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5  
##  Median :2009-10-30   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8  
##                                          NA's   :5

Handling the Missing Data

There are 3 missing values for ATM1 and 2 missing values for ATM2. As mentioned in Section 13.9, ARIMA models will work in handling missing values. To fix those 5 missing values, an ARIMA model was fitted on the data containing missing values and then used to interpolate the missing observations.

# summarize the missing data by ATM
ATM %>% 
  as.data.frame() %>%
  group_by(ATM) %>%
  summarise(`Missing Values` = sum(is.na(Cash)))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4 x 2
##   ATM   `Missing Values`
##   <chr>            <int>
## 1 ATM1                 3
## 2 ATM2                 2
## 3 ATM3                 0
## 4 ATM4                 0

ATM <- ATM %>%
  # Fit ARIMA model to the data containing missing values
  model(ARIMA(Cash)) %>%
  # Estimate Cash for the missing values.
  interpolate(ATM)

# check to see if any missing values again
ATM %>% 
  as.data.frame() %>%
  group_by(ATM) %>%
  summarise(`Missing Values` = sum(is.na(Cash)))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4 x 2
##   ATM   `Missing Values`
##   <chr>            <int>
## 1 ATM1                 0
## 2 ATM2                 0
## 3 ATM3                 0
## 4 ATM4                 0

Exporing the Data

At first glance, it appears that ATM1 and ATM2 have some seasonality to it while ATM3 seems to have only been active at the end of April 2010. There also seems to be a large outlier in ATM4. It would be best to remove this outlier and explore the data further.

ATM %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = " ATM Before Outlier Removal")

Individual ATM Exploring and Modeling

ATM 1

There is no trend for ATM1 and there is a weekly seasonality. The seasonality shows that there is a sudden drop in amount of cash withdrawn once a week. The remainder seems random up until March 2010, where there is a pattern in the increased variation.

#plot
ATM %>%
  filter(ATM == "ATM1") %>%
  autoplot(Cash) +
  ggtitle("Non-tranformed ATM1")

# STL decomposition
ATM %>%
  filter(ATM == "ATM1") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM1")

# lambda for box cox transformation
lambda <- ATM %>%
  filter(ATM == "ATM1") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm1_fit <- ATM %>%
  filter(ATM == "ATM1") %>%
  model(
    # additive ETS model
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    # multiplicative ETS model
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    # SNAIVE model
    snaive = SNAIVE(Cash),
    # transformed additive ETS model
    additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
    # transformed multiplicative ETS model
    multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
    # transformed SNAIVE model
    snaive_ts = SNAIVE(box_cox(Cash,lambda)),
    # arima model
    ARIMA = ARIMA(box_cox(Cash,lambda))
  )

# stats for the models
left_join(glance(atm1_fit) %>% select(.model:BIC), 
          accuracy(atm1_fit) %>% select(.model, RMSE)) %>%
  arrange(RMSE)

## # A tibble: 7 x 7
##   .model              sigma2 log_lik   AIC  AICc   BIC  RMSE
##   <chr>                <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 multiplicative_ts   0.0455  -1342. 2704. 2704. 2743.  23.0
## 2 additive          577.      -2232. 4485. 4486. 4524.  23.7
## 3 ARIMA               3.41     -728. 1464. 1464. 1480.  24.5
## 4 additive_ts         3.51    -1301. 2623. 2624. 2662.  24.6
## 5 multiplicative      0.134   -2273. 4566. 4566. 4605.  26.2
## 6 snaive            765.         NA    NA    NA    NA   27.6
## 7 snaive_ts           4.76       NA    NA    NA    NA   27.6

When it comes to choosing the best model, RMSE, AIC, AICc, or BIC can be used as the comparative statistic. The transformed multiplicative ETS model has the lowest RMSE but the ARIMA model has the lowest AIC, AICc, and BIC. Therefore, the ARIMA model was chosen to forecast May 2010.

# report of the arima model
atm1_fit %>% select(.model = "ARIMA") %>% report()

## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1249  -0.1046  -0.6359
## s.e.  0.0523   0.0516   0.0441
## 
## sigma^2 estimated as 3.414:  log likelihood=-728.12
## AIC=1464.23   AICc=1464.34   BIC=1479.75

\(ARIMA(0,0,2)(0,1,1)_7\) was modeled on ATM1.

# forecasting
fc_atm1 <- atm1_fit %>%
  forecast(h = 31) %>%
  filter(.model=='ARIMA')

# forcasted plot
fc_atm1 %>%
  autoplot(ATM) +
  ggtitle(TeX(paste0("ATM 1 Forcasted with $ARIMA(0,0,2)(0,1,1)_7$ and $\\lambda$ = ",
         round(lambda,2))))

# residuals
atm1_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals() +
  ggtitle(TeX(paste0("Residuals for $ARIMA(0,0,2)(0,1,1)_7$ with $\\lambda$ = ",
         round(lambda,2))))

#Box-Pierce test, ℓ=2m for seasonal data, m=7
atm1_fit %>% 
  select(.model = "ARIMA") %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 14, dof = 0)

## # A tibble: 1 x 4
##   .model bp_stat bp_pvalue .name_repair
##   <chr>    <dbl>     <dbl> <chr>       
## 1 .model    10.4     0.730 minimal

The residuals seem to be white noise with a few data points in the left tail.

ATM 2

There is a weekly seasonality. The seasonality shows that there is a sudden drop followed by a sharp increase in amount of cash withdrawn once a week. The remainder seems random up until February/March 2010, where there is an increased variation. The data seems to have a slightly decreasing trend with a sudden increase and decrease around February/March 2010.

#plot
ATM %>%
  filter(ATM == "ATM2") %>%
  autoplot(Cash) +
  ggtitle("Non-tranformed ATM2")

# STL decomposition
ATM %>%
  filter(ATM == "ATM2") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM2")

# lambda for box cox transformation
lambda <- ATM %>%
  filter(ATM == "ATM2") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm2_fit <- ATM %>%
  filter(ATM == "ATM2") %>%
  model(
    # additive ETS model
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    # multiplicative ETS model
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    # SNAIVE model
    snaive = SNAIVE(Cash),
    # transformed additive ETS model
    additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
    # transformed multiplicative ETS model
    multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
    # transformed SNAIVE model
    snaive_ts = SNAIVE(box_cox(Cash,lambda)),
    # arima model
    ARIMA = ARIMA(box_cox(Cash,lambda))
  )

# stats for the models
left_join(glance(atm2_fit) %>% select(.model:BIC), 
          accuracy(atm2_fit) %>% select(.model, RMSE)) %>%
  arrange(RMSE)

## # A tibble: 7 x 7
##   .model             sigma2 log_lik   AIC  AICc   BIC  RMSE
##   <chr>               <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA              33.5    -1136. 2284. 2284. 2307.  24.3
## 2 additive          622.     -2246. 4512. 4513. 4551.  24.6
## 3 additive_ts        35.7    -1725. 3469. 3470. 3508.  25.3
## 4 snaive_ts          48.4       NA    NA    NA    NA   29.6
## 5 snaive            881.        NA    NA    NA    NA   29.6
## 6 multiplicative_ts   0.270  -1891. 3801. 3802. 3840.  33.6
## 7 multiplicative      0.738  -2438. 4897. 4898. 4936.  76.5

When it comes to choosing the best model, RMSE, AIC, AICc, or BIC can be used as the comparative statistic. The ARIMA model has the lowest RMSE, AIC, AICc, and BIC. Therefore, the ARIMA model was chosen to forecast May 2010.

# report of the arima model
atm2_fit %>% select(.model = "ARIMA") %>% report()

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4174  -0.8997  0.4575  0.7879  -0.7083
## s.e.   0.0599   0.0485  0.0893  0.0628   0.0423
## 
## sigma^2 estimated as 33.5:  log likelihood=-1136.02
## AIC=2284.04   AICc=2284.28   BIC=2307.32

\((2,0,2)(0,1,1)_7\) was modeled on ATM2.

# forecasting
fc_atm2 <- atm2_fit %>%
  forecast(h = 31) %>%
  filter(.model=='ARIMA')

# forcasted plot
fc_atm2 %>%
  autoplot(ATM) +
  ggtitle(TeX(paste0("ATM 2 Forcasted with $ARIMA(2,0,2)(0,1,1)_7$ and $\\lambda$ = ",
         round(lambda,2))))

# residuals
atm2_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals() +
  ggtitle(TeX(paste0("Residuals for $ARIMA(2,0,2)(0,1,1)_7$ with $\\lambda$ = ",
         round(lambda,2))))

#Box-Pierce test, ℓ=2m for seasonal data, m=7
atm2_fit %>% 
  select(.model = "ARIMA") %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 14, dof = 0)

## # A tibble: 1 x 4
##   .model bp_stat bp_pvalue .name_repair
##   <chr>    <dbl>     <dbl> <chr>       
## 1 .model    9.95     0.766 minimal

The residuals seem to be white noise with a few residuals in both tails.

ATM 3

Unlike the other ATMs, ATM3 only has three data points which would not be enough to forecast an entire month, nor find an trend or seasonality. It would be best to use the mean of these three days to forecast the month of May.

#plot
ATM %>%
  filter(ATM == "ATM3") %>%
  autoplot(Cash) +
  ggtitle("Non-tranformed ATM3")

# mean model
atm3_fit <- ATM %>%
  filter(ATM == "ATM3",
         Cash != 0) %>%
  model(MEAN(Cash))

# forecasting
fc_atm3 <- atm3_fit %>%
  forecast(h = 31) 

# forcasted plot
fc_atm3 %>%
  autoplot(ATM) +
  ggtitle("ATM 3 Forecasted with the MEAN() Model")

ATM 4

There is a clear outlier in ATM4 and it would be best to remove it in order to forecast the data. We then can then interpolate it for imputation. If we define outliers as those that are greater than 3 interquartile ranges, there are 2 outliers that are identified: September 22 and February 9.

#plot
ATM %>%
  filter(ATM == "ATM4") %>%
  autoplot(Cash) +
  ggtitle("Non-tranformed ATM4")

# STL decomposition
dcmp_4 <- ATM %>%
  filter(ATM == "ATM4") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components()

dcmp_4 %>%  
  autoplot() +
  labs(title = "STL Decomposition for ATM4")

# identifying the outliers
outliers <- dcmp_4 %>%
  filter(remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
           remainder > quantile(remainder, 0.75) + 3*IQR(remainder))
outliers

## # A dable: 2 x 8 [1D]
## # Key:     ATM, .model [1]
## # :        Cash = trend + season_week + remainder
##   ATM   .model       Date         Cash trend season_week remainder season_adjust
##   <chr> <chr>        <date>      <dbl> <dbl>       <dbl>     <dbl>         <dbl>
## 1 ATM4  "STL(Cash ~~ 2009-09-22  1712.  373.       -71.5     1411.         1784.
## 2 ATM4  "STL(Cash ~~ 2010-02-09 10920.  279.       -71.5    10712.        10991.

ATM_miss <- ATM %>%
  #replace outliers
  mutate(Cash = replace(Cash, ATM == "ATM4" & Date == "2010-02-09", NA)) 
  
ATM <- ATM_miss %>%
  # Fit ARIMA model to the data with missing values
  model(ARIMA(Cash)) %>%
  # Estimate Cash for the missing values / outliers
  interpolate(ATM_miss)

There is a weekly seasonality. The seasonality shows that there is a sudden drop followed by a sharp increase in amount of cash withdrawn once a week. However, the gray bar is large in the seasonality, meaning the variation is the smallest in the seasonality compared to the variation in the data. The remainder seems random for most of the data and it picks up the outlier on September 22, which was not removed.

#plot
ATM %>%
  filter(ATM == "ATM4") %>%
  autoplot(Cash) +
  ggtitle("ATM4 with No Outlier")

# STL decomposition
ATM %>%
  filter(ATM == "ATM4") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM4 with No Outlier")

# lambda for box cox transformation
lambda <- ATM %>%
  filter(ATM == "ATM4") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm4_fit <- ATM %>%
  filter(ATM == "ATM4") %>%
  model(
    # additive ETS model
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    # multiplicative ETS model
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    # SNAIVE model
    snaive = SNAIVE(Cash),
    # transformed additive ETS model
    additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
    # transformed multiplicative ETS model
    multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
    # transformed SNAIVE model
    snaive_ts = SNAIVE(box_cox(Cash,lambda)),
    # arima model
    ARIMA = ARIMA(box_cox(Cash,lambda)),
     # transformed additive ETS model, no seasonality
    additive_ts_no_s = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("N")),
    # transformed multiplicative ETS model, no seasonality
    multiplicative_ts_no_s = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("N"))
  )

# stats for the models
left_join(glance(atm4_fit) %>% select(.model:BIC), 
          accuracy(atm4_fit) %>% select(.model, RMSE)) %>%
  arrange(RMSE)

## # A tibble: 9 x 7
##   .model                     sigma2 log_lik   AIC  AICc   BIC  RMSE
##   <chr>                       <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 additive               110883.     -3192. 6404. 6405. 6443.  329.
## 2 additive_ts                95.3    -1904. 3828. 3828. 3867.  340.
## 3 multiplicative_ts           0.192  -1900. 3820. 3821. 3859.  346.
## 4 multiplicative              0.728  -3192. 6404. 6405. 6443.  354.
## 5 ARIMA                     101.     -1358. 2726. 2726. 2746.  354.
## 6 multiplicative_ts_no_s      0.211  -1938. 3883. 3883. 3894.  366.
## 7 additive_ts_no_s          113.     -1938. 3883. 3883. 3894.  366.
## 8 snaive                 205867.        NA    NA    NA    NA   453.
## 9 snaive_ts                 164.        NA    NA    NA    NA   453.

The ETS() models with no seasonality ranked worse than those with seasonality. The Additive ETS() models has the highest RMSE but the other statistics are worse compared to models with transformed data. Yet again, the ARIMA model has the lowest AIC, AICc, and BIC. Therefore, the ARIMA model was chosen to forecast May 2010.

# report of the arima model
atm4_fit %>% select(.model = "ARIMA") %>% report()

## Series: Cash 
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##          ma1    sar1    sar2  constant
##       0.0784  0.2131  0.2086   13.3387
## s.e.  0.0528  0.0516  0.0525    0.5518
## 
## sigma^2 estimated as 100.6:  log likelihood=-1358.09
## AIC=2726.18   AICc=2726.34   BIC=2745.68

\((0,0,1)(2,0,0)_7\) was modeled on ATM4.

# forecasting
fc_atm4 <- atm4_fit %>%
  forecast(h = 31) %>%
  filter(.model=='ARIMA')

# forcasted plot
fc_atm4 %>%
  autoplot(ATM) +
  ggtitle(TeX(paste0("ATM 4 Forcasted with $ARIMA(0,0,1)(2,0,0)_7$ and $\\lambda$ = ",
         round(lambda,2))))

# residuals
atm4_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals() +
  ggtitle(TeX(paste0("Residuals for $ARIMA(0,0,1)(2,0,0)_7$ with $\\lambda$ = ",
         round(lambda,2))))

#Box-Pierce test, ℓ=2m for seasonal data, m=7
atm2_fit %>% 
  select(.model = "ARIMA") %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 14, dof = 0)

## # A tibble: 1 x 4
##   .model bp_stat bp_pvalue .name_repair
##   <chr>    <dbl>     <dbl> <chr>       
## 1 .model    9.95     0.766 minimal

The residuals seem to be white noise with a few residuals in both tails. The forecasts may not seem like they model well as it can be seen that they taper off to the mean. However, the prediction levels may capture the true values for May 2010.

Forecasted Data

Here are the final forecasts on all four ATMs. ATM1 and ATM2 keep up with the seasonality and its variation. ATM3 is just the average of the non-zero data. Lastly, ATM4 seems to diminish to the mean the further into May we go.

# save as data frame
fc <- bind_rows(fc_atm1, fc_atm2, fc_atm3, fc_atm4) %>%
  as.data.frame() %>%
  select(Date, ATM, .mean) %>%
  rename(Cash = .mean)

# export file
fc %>% write.csv("ATM_forecasts.csv")

# plot of may 2010
fc %>%
  as_tsibble(index = Date, key = ATM) %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "Forecasted ATM Withdrawls in May 2010")

# altogether plot
fc %>%
  as_tsibble(index = Date, key = ATM) %>%
  autoplot(Cash) +
  autolayer(ATM, Cash, colour = "black") +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Withdrawls")

Part B | Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Loading the Data

Month was converted from text to a a monthly time object. The data was then transformed into a tsibble object.

# reading in excel file
Res_Load <- read_excel("ResidentialCustomerForecastLoad-624.xlsx") %>%
  # renaming column name
  rename(Month = 'YYYY-MMM') %>%
  # converting into date format
  mutate(Month = yearmonth(Month)) %>%
  # converting to tsibble
  as_tsibble(index = Month) 

head(Res_Load)

## # A tsibble: 6 x 3 [1M]
##   CaseSequence    Month     KWH
##          <dbl>    <mth>   <dbl>
## 1          733 1998 Jan 6862583
## 2          734 1998 Feb 5838198
## 3          735 1998 Mar 5420658
## 4          736 1998 Apr 5010364
## 5          737 1998 May 4665377
## 6          738 1998 Jun 6467147

Handling the Missing Data

There is 1 missing values in the KWH column. As mentioned in Section 13.9, ARIMA models will work in handling missing values. To fix the missing value, an ARIMA model was fitted on the data containing the missing value and then used to interpolate the missing observation.

summary(Res_Load$KWH)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1

Res_Load <- Res_Load %>%
  # Fit ARIMA model to the data containing missing values
  model(ARIMA(KWH)) %>%
  # Estimate Cash for the missing values.
  interpolate(Res_Load)

# check to see if any missing values again
Res_Load %>% 
  as.data.frame() %>%
  summarise(`Missing Values` = sum(is.na(KWH)))

##   Missing Values
## 1              0

Exploring the Data

At first glance, it seems that the data is seasonal and has a slight increasing trend. There is also an outlier in July 2010 with a value of 770523. This is most likely due to an error in the data and is missing an additional digit. To fix the outlier, it will be best to remove it and interpolate it.

Res_Load %>%
  autoplot(KWH) +
  labs(title = "Residential Power Usage")

Res_miss <- Res_Load %>%
  #replace outliers with NA
  mutate(KWH = replace(KWH, KWH == 770523, NA)) 
  
Res_Load <- Res_miss %>%
  # Fit ARIMA model to the data with missing values
  model(ARIMA(KWH)) %>%
  # Estimate Cash for the missing values / outliers
  interpolate(Res_miss)

With the outlier removed, the graph looks more readable and sensible. There appears to be an increasing trend which accounts for the smallest variation in the data compared to the others. There is an annual seasonality, with peaks occurring in the summer and winter. The remained looks random, with a spike in Dec 2013.

#plot
Res_Load %>%
  autoplot(KWH) +
  ggtitle("Residential Power Usage with no Outlier")

# STL decomposition
Res_Load %>%
  model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition with no Outlier")

The seasonal plot further shows the increase in power every summer and winter. There is a clear increasing trend in the power usage during the peak months. The off-peak months probably vary every year due to the weather.

Res_Load %>%
  gg_season(KWH, labels = "both") +
  labs(title = "Seasonal plot: Residential Power Usage")

Res_Load %>%
  gg_subseries(KWH) +
  labs(title = "Residential Power Usage")

Modeling and Forecasting

A box-cox transformation was applied to some of the models and compared to other models where there was no transformation.

# lambda for box cox transformation
lambda <- Res_Load %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

res_fit <- Res_Load %>%
  model(
    # additive ETS model
    additive = ETS(KWH ~ error("A") + trend("A") + season("A")),
    # multiplicative ETS model
    multiplicative = ETS(KWH ~ error("M") + trend("A") + season("M")),
    # additive damped model
    damped = ETS(KWH ~ error("A") + trend("Ad") + season("A")),
    # SNAIVE model
    snaive = SNAIVE(KWH),
    # transformed additive ETS model
    additive_bc = ETS(box_cox(KWH,lambda) ~ error("A") + trend("A") + season("A")),
    # transformed multiplicative ETS model
    multiplicative_bc = ETS(box_cox(KWH,lambda) ~ error("M") + trend("A") + season("M")),
    # transformed additive damped model
    damped_bc = ETS(box_cox(KWH,lambda) ~ error("A") + trend("Ad") + season("A")),
    # transformed SNAIVE model
    snaive_bc = SNAIVE(box_cox(KWH,lambda)),
    # arima model
    ARIMA = ARIMA(box_cox(KWH,lambda))
  )

# stats for the models
left_join(glance(res_fit) %>% select(.model:BIC), 
          accuracy(res_fit) %>% select(.model, RMSE)) %>%
  arrange(AICc)

## # A tibble: 9 x 7
##   .model              sigma2 log_lik    AIC   AICc    BIC    RMSE
##   <chr>                <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1 ARIMA             1.51e- 5    752. -1493. -1493. -1478. 627673.
## 2 additive_bc       1.40e- 5    577. -1120. -1116. -1064. 610956.
## 3 damped_bc         1.41e- 5    576. -1116. -1112. -1058. 618831.
## 4 multiplicative_bc 6.51e- 7    575. -1115. -1112. -1060. 625349.
## 5 multiplicative    9.00e- 3  -3053.  6140.  6144.  6196. 613581.
## 6 additive          4.19e+11  -3065.  6165.  6168.  6220. 619703.
## 7 damped            4.25e+11  -3066.  6169.  6173.  6227. 622538.
## 8 snaive            6.51e+11     NA     NA     NA     NA  809926.
## 9 snaive_bc         2.25e- 5     NA     NA     NA     NA  809926.

Unlike the other data, this one produced negative values for AICc, AIC, and BIC. The additive exponential smoothing with a box cox transformation had the lowest RMSE, but the ARIMA had the best AIC, AICc, and BIC. Therefore, the ARIMA model is the best model to choose.

# report of the arima model
res_fit %>% select(.model = "ARIMA") %>% report()

## Series: KWH 
## Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift 
## Transformation: box_cox(KWH, lambda) 
## 
## Coefficients:
##          ma1     sar1     sar2  constant
##       0.2462  -0.7202  -0.3546    0.0013
## s.e.  0.0833   0.0751   0.0769    0.0004
## 
## sigma^2 estimated as 1.505e-05:  log likelihood=751.74
## AIC=-1493.47   AICc=-1493.13   BIC=-1477.51

\(ARIMA(0,0,1)(2,1,0)_{12}\) with drift was fitted on KWH.

# forecasting
fc_res <- res_fit %>%
  forecast(h = 12) %>%
  filter(.model=='ARIMA')

# forcasted plot
fc_res %>%
  autoplot(Res_Load) +
  ggtitle(TeX(paste0("ATM 1 Forcasted with $(0,0,1)(2,1,0)_{12}$ and $\\lambda$ = ",
         round(lambda,2))))

# residuals
res_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals(lag = 24) +
  ggtitle(TeX(paste0("Residuals for $(0,0,1)(2,1,0)_{12}$ with $\\lambda$ = ",
         round(lambda,2))))

#Box-Pierce test, ℓ=2m for seasonal data, m=12
res_fit %>% 
  select(.model = "ARIMA") %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 24, dof = 0)

## # A tibble: 1 x 4
##   .model bp_stat bp_pvalue .name_repair
##   <chr>    <dbl>     <dbl> <chr>       
## 1 .model    29.9     0.187 minimal

The residuals seem to be white noise as shown by the graphs and the Box-Pierce test. It is interesting to note that the first few residuals seem to be constant.

Here are the final forecasted data using the ARIMA model:

fc_res <- fc_res %>%
  as.data.frame() %>%
  select(Month, .mean) %>%
  rename(KWH = .mean) %>%
  # adding back the case sequence
  mutate(CaseSequence = 925:936) %>%
  # rearranging order
  relocate(CaseSequence)
fc_res

##    CaseSequence    Month      KWH
## 1           925 2014 Jan 10297282
## 2           926 2014 Feb  8531331
## 3           927 2014 Mar  6650696
## 4           928 2014 Apr  5974089
## 5           929 2014 May  5942940
## 6           930 2014 Jun  8291370
## 7           931 2014 Jul  9539978
## 8           932 2014 Aug 10113059
## 9           933 2014 Sep  8474487
## 10          934 2014 Oct  5849784
## 11          935 2014 Nov  6112937
## 12          936 2014 Dec  8253088

# export file
fc_res %>% write.csv("Res_forecasts.csv")

Part C | BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Exploring and Forecasting Data

Pipe 1

Date Time was transformed as it was a 5-digit numeric type that can be converted since 00000 would 1/1/1900. Pipe1 has multiple recordings within an hour, so the average of each hour is taken. The data was then transformed into a tsibble object.

# reading in excel file
temp <- read_excel("Waterflow_Pipe1.xlsx", col_types = c('date', 'numeric')) %>%
  # converting into date format
  mutate(`Date Time` = as_datetime(`Date Time`)) %>%
  # renaming column name
  rename(DateTime = `Date Time`) %>%
  #separate the date and hour
  mutate(date = as.Date(DateTime),
         # get the hour
         hour = paste(format(DateTime, format = "%H"),":00:00"))

Pipe1 <- temp %>%
  # replace the date time with the rounded hours
  mutate(DateTime = ymd(date) + hms(hour)) %>%
  # grouping
  group_by(DateTime) %>%
  # taking average for each hour
  mutate(WaterFlow = mean(WaterFlow)) %>%
  # deleting duplicate rows
  distinct(DateTime, WaterFlow) %>%
  # converting to tsibble
  as_tsibble(index = DateTime)
 
head(Pipe1)

## # A tsibble: 6 x 2 [1h] <UTC>
## # Groups:    @ DateTime [6]
##   DateTime            WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 00:00:00      26.1
## 2 2015-10-23 01:00:00      18.9
## 3 2015-10-23 02:00:00      15.2
## 4 2015-10-23 03:00:00      23.1
## 5 2015-10-23 04:00:00      15.5
## 6 2015-10-23 05:00:00      22.7

summary(Pipe1)

##     DateTime                     WaterFlow     
##  Min.   :2015-10-23 00:00:00   Min.   : 8.923  
##  1st Qu.:2015-10-25 10:45:00   1st Qu.:17.033  
##  Median :2015-10-27 22:30:00   Median :19.784  
##  Mean   :2015-10-27 22:38:53   Mean   :19.893  
##  3rd Qu.:2015-10-30 10:15:00   3rd Qu.:22.789  
##  Max.   :2015-11-01 23:00:00   Max.   :31.730

There appears to be missing data for 4 hours in Pipe1. To fix this, we would have to interpolate the data using imputeTS. The interpolation method from Parts A and B did not work here, so a diferent method was used.

# find the missing hours and fill in NA
Pipe1 <- fill_gaps(Pipe1)

# identifies the hours
miss <- Pipe1 %>%
  filter(is.na(WaterFlow))
miss

## # A tsibble: 4 x 2 [1h] <UTC>
## # Groups:    @ DateTime [4]
##   DateTime            WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-27 17:00:00        NA
## 2 2015-10-28 00:00:00        NA
## 3 2015-11-01 05:00:00        NA
## 4 2015-11-01 09:00:00        NA

# interpolates the missing variables
temp <- temp %>%
  select(DateTime, WaterFlow) %>%
  rbind(.,miss) %>%
  as_tsibble(index = DateTime) %>%
  na_interpolation(.)

# combining the data, and replacing the NA values
Pipe1 <-left_join(Pipe1, temp, by = "DateTime") %>%
  mutate(WaterFlow  = coalesce(WaterFlow.x, WaterFlow.y)) %>%
  select(DateTime, WaterFlow)

The data is stationary as we fail to reject the null hypothesis.

Pipe1 %>%
  gg_tsdisplay(WaterFlow, plot_type='partial') +
  labs(title = "Water Flow of Pipe 1")

Pipe1 %>%
  features(WaterFlow, unitroot_kpss)

## # A tibble: 1 x 3
##   kpss_stat kpss_pvalue .name_repair
##       <dbl>       <dbl> <chr>       
## 1     0.269         0.1 minimal

Looking at the STL decomposition, there seems to be a daily seasonality.

Pipe1 %>%
  model(STL(WaterFlow ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition")

The data is transformed using box-cox. The ARIMA model with the box-cox transformation has lower statistics, meaning it is the better model.

# lambda for box cox transformation
lambda <- Pipe1 %>%
  features(WaterFlow, features = guerrero) %>%
  pull(lambda_guerrero)

p1_fit <- Pipe1 %>%
  model(
    # arima model
    ARIMA_bc = ARIMA(box_cox(WaterFlow,lambda)),
    ARIMA = ARIMA(WaterFlow)
  )

glance(p1_fit) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 2 x 6
##   .model   sigma2 log_lik   AIC  AICc   BIC
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA      18.2   -688. 1380. 1380. 1387.
## 2 ARIMA_bc   57.1   -825. 1655. 1655. 1662.

p1_fit %>% select(.model = "ARIMA_bc") %>% report()

## Series: WaterFlow 
## Model: ARIMA(0,0,0) w/ mean 
## Transformation: box_cox(WaterFlow, lambda) 
## 
## Coefficients:
##       constant
##        29.0075
## s.e.    0.4866
## 
## sigma^2 estimated as 57.07:  log likelihood=-825.36
## AIC=1654.71   AICc=1654.76   BIC=1661.67

The ARIMA model is just the mean of the data. The residuals appear to be white noise.

# forecasting the data
p1_fc <-p1_fit %>%
  forecast(h = 168) %>%
  filter(.model=='ARIMA_bc')

# forecasted plot
p1_fc %>%
  autoplot(Pipe1) +
  ggtitle(TeX(paste0("Pipe 1 Forecasted with ARIMA $(0,0,0)$ with mean and $\\lambda$ = ",
         round(lambda,2))))

# residual plot
p1_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals() +
  ggtitle("Residuals for Pipe 1 | ARIMA(0,0,0) with mean")

Pipe 2

The dates were converted for Pipe2 and the data was converted to a tsibble. There are no missing data nor hours.

# reading in excel file
Pipe2 <- read_excel("Waterflow_Pipe2.xlsx", col_types = c('date', 'numeric')) %>%
  # converting into date format
  mutate(`Date Time` = as_datetime(`Date Time`)) %>%
  # renaming column name
  rename(DateTime = `Date Time`) %>%
  # converting to tsibble
  as_tsibble(index = DateTime) 

head(Pipe2)

## # A tsibble: 6 x 2 [1h] <UTC>
##   DateTime            WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 01:00:00      18.8
## 2 2015-10-23 02:00:00      43.1
## 3 2015-10-23 03:00:00      38.0
## 4 2015-10-23 04:00:00      36.1
## 5 2015-10-23 05:00:00      31.9
## 6 2015-10-23 06:00:00      28.2

summary(Pipe2)

##     DateTime                     WaterFlow     
##  Min.   :2015-10-23 01:00:00   Min.   : 1.885  
##  1st Qu.:2015-11-02 10:45:00   1st Qu.:28.140  
##  Median :2015-11-12 20:30:00   Median :39.682  
##  Mean   :2015-11-12 20:30:00   Mean   :39.556  
##  3rd Qu.:2015-11-23 06:15:00   3rd Qu.:50.782  
##  Max.   :2015-12-03 16:00:00   Max.   :78.303

The data is stationary as we fail to reject the null hypothesis.

Pipe2 %>%
  gg_tsdisplay(WaterFlow, plot_type='partial') +
  labs(title = "Water Flow of Pipe 2")

Pipe2 %>%
  features(WaterFlow, unitroot_kpss)

## # A tibble: 1 x 3
##   kpss_stat kpss_pvalue .name_repair
##       <dbl>       <dbl> <chr>       
## 1     0.105         0.1 minimal

Looking at the STL decomposition, there seems to be some seasonality on a daily basis, as well as weekly.

Pipe2 %>%
  model(STL(WaterFlow ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition")

The data is transformed using box-cox. The ARIMA model without the box-cox transformation has lower statistics, meaning it is the better model.

# lambda for box cox transformation
lambda <- Pipe2 %>%
  features(WaterFlow, features = guerrero) %>%
  pull(lambda_guerrero)

p2_fit <- Pipe2 %>%
  model(
    # arima model
    ARIMA_bc = ARIMA(box_cox(WaterFlow,lambda)),
    ARIMA = ARIMA(WaterFlow)
  )

glance(p2_fit) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 2 x 6
##   .model   sigma2 log_lik   AIC  AICc   BIC
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA_bc   195.  -4054. 8114. 8114. 8128.
## 2 ARIMA      256.  -4191. 8387. 8387. 8402.

p2_fit %>% 
  select(.model = "ARIMA_bc") %>% report()

## Series: WaterFlow 
## Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean 
## Transformation: box_cox(WaterFlow, lambda) 
## 
## Coefficients:
##         sma1  constant
##       0.0845   34.6140
## s.e.  0.0314    0.4772
## 
## sigma^2 estimated as 194.7:  log likelihood=-4053.87
## AIC=8113.75   AICc=8113.77   BIC=8128.47

With a large h in our forecast, the forecasts eventually become mean of the data. The residuals appear to be white noise.

# forecasting the data
p2_fc <-p2_fit %>%
  forecast(h = 168) %>%
  filter(.model=='ARIMA')

# forecasted plot
p2_fc %>%
  autoplot(Pipe2) +
  ggtitle(TeX(paste0("Pipe 2 Forecasted with ARIMA $(0,0,0)$ with mean and $\\lambda$ = ",
         round(lambda,2))))

# residual plot
p2_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals() +
  ggtitle("Residuals for Pipe 1 | ARIMA(0,0,0) with mean")

Final Forecasts

Here are the final forecasted data using the ARIMA model. Since h was large, the forecasts resemble the mean for the most part. Although it can be forecasted, it might best not to forecast the data as they do not predict future values well.

p1_fc <- p1_fc %>%
  as.data.frame() %>%
  select(DateTime, .mean) %>%
  rename(WaterFlow = .mean)

p2_fc <- p2_fc %>%
  as.data.frame() %>%
  select(DateTime, .mean) %>%
  rename(WaterFlow = .mean)

# export file

write.xlsx(list('Pipe1' = p1_fc, 'Pipe2' = p2_fc), file = 'pipes.xlsx')

DATA 624 Project 1

Orli Khaimova

4/3/2022

Part A | ATM Forecast, ATM624Data.xlsx

Loading the Data

Handling the Missing Data

Exporing the Data

Individual ATM Exploring and Modeling

ATM 1

ATM 2

ATM 3

ATM 4

Forecasted Data

Part B | Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Loading the Data

Handling the Missing Data

Exploring the Data

Modeling and Forecasting

Part C | BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Exploring and Forecasting Data

Pipe 1

Pipe 2

Final Forecasts