Project 1

Part A

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010.
The data is given in a single file.

The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward.

I am being somewhat ambiguous on purpose to make this have a little more business feeling.

Explain and demonstrate your process, techniques used and not used, and your actual forecast.

I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file

Also please submit the forecast which you will put in an Excel readable file.

First, let’s read in the data. I added it to my Github so that this code can be used on any computer.

head(atm)

## # A tibble: 6 × 3
##   Date                ATM    Cash
##   <dttm>              <chr> <dbl>
## 1 2009-05-01 00:00:00 ATM1     96
## 2 2009-05-01 00:00:00 ATM2    107
## 3 2009-05-02 00:00:00 ATM1     82
## 4 2009-05-02 00:00:00 ATM2     89
## 5 2009-05-03 00:00:00 ATM1     85
## 6 2009-05-03 00:00:00 ATM2     90

summary(atm)

##       Date                           ATM                 Cash        
##  Min.   :2009-05-01 00:00:00.0   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01 00:00:00.0   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01 00:00:00.0   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31 19:33:27.5                      Mean   :  155.6  
##  3rd Qu.:2010-02-01 00:00:00.0                      3rd Qu.:  114.0  
##  Max.   :2010-05-14 00:00:00.0                      Max.   :10919.8  
##                                                     NA's   :19

I had an initial issue with dates, since excel saves them as an internal number, so we had to convert, there were no decimal points in internal numbers indicating that even though in the excel, it looked like a time stamp, it actually was just a date that someone converted to time stamp. It looks like the min date is 5-1-2009 and the max is 5-14-2010, so a little more than a year. For this project, we will be creating a forecast for all of May, and using 5-1-2010 - 5-14-2010. First, it does look like we have some null values that we should check out.

Let’s make the dataset so that each ATM is it’s own column

atm_s<- atm %>% spread(ATM, Cash)

atm_s <- atm_s %>% select(Date, ATM1, ATM2, ATM3, ATM4)

atm_names <- colnames(atm_s)[2:5]

for (i in atm_names) {
  print(i)
  print(sum(is.na(atm_s[i]))) 
}

## [1] "ATM1"
## [1] 17
## [1] "ATM2"
## [1] 16
## [1] "ATM3"
## [1] 14
## [1] "ATM4"
## [1] 14

# Several are dates, so we will remove the dates that are missing and run again
atm_s <- atm_s %>%
  filter(Date < "2010-05-01")

for (i in atm_names) {
  print(i)
  print(sum(is.na(atm_s[i]))) 
}

## [1] "ATM1"
## [1] 3
## [1] "ATM2"
## [1] 2
## [1] "ATM3"
## [1] 0
## [1] "ATM4"
## [1] 0

We can see that the missing values are limited to ATM1 and ATM2 and also the data after 5-1-2010 is blank. So for the days with missing values, let’s make them the mean for that ATM, since there are very few missing days, we should be fine.

atm_s <- atm_s %>% mutate(across(c(ATM1, ATM2), ~replace_na(., mean(., na.rm=TRUE))))

paste0("Now that we added in the values and removed the dates we have: ", sum(is.na(atm_s))," NAs")

## [1] "Now that we added in the values and removed the dates we have: 0 NAs"

Now let’s get the summary stats for each

par(mfrow=c(4,1))
for (i in atm_names) {
  print(summary(atm_s[i]))
  boxplot(atm_s[i], horizontal = TRUE)
}

##       ATM1       
##  Min.   :  1.00  
##  1st Qu.: 73.00  
##  Median : 90.00  
##  Mean   : 83.89  
##  3rd Qu.:108.00  
##  Max.   :180.00

##       ATM2       
##  Min.   :  0.00  
##  1st Qu.: 26.00  
##  Median : 66.00  
##  Mean   : 62.58  
##  3rd Qu.: 93.00  
##  Max.   :147.00

##       ATM3        
##  Min.   : 0.0000  
##  1st Qu.: 0.0000  
##  Median : 0.0000  
##  Mean   : 0.7206  
##  3rd Qu.: 0.0000  
##  Max.   :96.0000

##       ATM4          
##  Min.   :    1.563  
##  1st Qu.:  124.334  
##  Median :  403.839  
##  Mean   :  474.043  
##  3rd Qu.:  704.507  
##  Max.   :10919.762

We can see that ATM4 makes up the bulk of the money withdrawn, while ATM1 and ATM2 see a fair amount, while the third sees almost nothing, especially since there are only three data points.

Now let’s take a look at the plot. Before we can plot the data, we first have to make sure the data is a tsibble.

atm_ts <- atm_s %>%
mutate(Day = as_date(Date)) %>%
as_tsibble(index = Day)

atm_ts <- atm_ts %>% select(Day, ATM1, ATM2, ATM3, ATM4)

atm_ts %>%
gather("ATM","Cash", 2:5) %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "Money Withdrawn USD",
       y= "USD in Hundreds",
       x = "Day")

From this plot we can see that ATMs 1,2,4 have consistent data and have been there a while, while ATM3 was just created. For that ATM, we may just want to look at a Naive forecast since there is no historical data. For this next part, we will look at each ATM individually and create forecasts for them.

But first, to be a good Data Scientist, we need to split the data into train and test

atm_train <- atm_ts %>%
  filter_index(~ "2010-03")

atm_test <- atm_ts %>%
  filter_index("2010-04")

ATM1

atm_train %>%
  select(ATM1) %>%
  autoplot() +
  ggtitle("ATM1")

## Plot variable not specified, automatically selected `.vars = ATM1`

# STL decomposition
atm_train %>%
  select(ATM1) %>%
  model(STL(ATM1 ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM1")

the seasonal data stands out for each week, with a low point. Let’s find out when it is.

atm_train %>%
  add_column(day_no = as.vector(wday(atm_train$Day, week_start = 7))) %>%
  as_tibble() %>%
  select(ATM1,day_no) %>%
  group_by(day_no) %>%
  summarise_all(mean)

## # A tibble: 7 × 2
##   day_no  ATM1
##    <dbl> <dbl>
## 1      1 103. 
## 2      2  87.0
## 3      3  96.6
## 4      4  80.6
## 5      5  26.5
## 6      6  99.8
## 7      7  96.9

It looks like Thursdays are the low date, my guess us that is when it is restocked. We can also see the trend has a semi equal amplitude with no obvious indication if it is increasing or decreasing. So with that in mind, let’s take a look at several model with and without the box-cox. We will run whichever ETS is best, plus the alternative M or A method, SNaive, and for Box plot the same with the addition of an ARIMA.

lambda_ATM1 <- atm_train %>%
  features(ATM1, features = guerrero) %>%
  pull(lambda_guerrero)

fit <- atm_train %>%
  model(ETS(ATM1))

report(fit)

## Series: ATM1 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.01333705 
##     gamma = 0.3243183 
## 
##   Initial states:
##      l[0]      s[0]    s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  77.98459 -66.76148 3.941321 16.12065 6.093844 14.60934 8.524811 17.47152
## 
##   sigma^2:  624.7121
## 
##      AIC     AICc      BIC 
## 4115.103 4115.782 4153.244

atm1_fit <- atm_train %>%
  model(
    # additive ETS model
    additive = ETS(ATM1 ~ error("A") + trend("N") + season("A")),
    
    # damped additive ETS model
    additived = ETS(ATM1 ~ error("A") + trend("Ad") + season("A")),
    
    # multiplicative ETS model
    multiplicative = ETS(ATM1 ~ error("M") + trend("N") + season("M")),
    
    # SNAIVE model
    snaive = SNAIVE(ATM1~ lag("week")),
    
    # transformed additive ETS model
    additive_l = ETS(box_cox(ATM1,lambda_ATM1) ~ error("A") + trend("N") + season("A")),
    
    # transformed multiplicative ETS model
    multiplicative_l = ETS(box_cox(ATM1,lambda_ATM1) ~ error("M") + trend("N") + season("M")),
    
    # transformed SNAIVE model
    snaive_l = SNAIVE(box_cox(ATM1,lambda_ATM1)~ lag("week")),
    
    # arima model
    ARIMA = ARIMA(box_cox(ATM1,lambda_ATM1))
  )

report(atm1_fit)

## # A tibble: 8 × 11
##   .model   sigma2 log_lik   AIC  AICc   BIC    MSE   AMSE    MAE ar_ro…¹ ma_ro…²
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl> <list>  <list> 
## 1 additi… 6.25e+2  -2048. 4115. 4116. 4153. 608.   611.   15.7   <NULL>  <NULL> 
## 2 additi… 6.32e+2  -2048. 4122. 4123. 4171. 609.   611.   15.8   <NULL>  <NULL> 
## 3 multip… 1.40e-1  -2083. 4186. 4187. 4224. 748.   760.    0.223 <NULL>  <NULL> 
## 4 snaive  8.37e+2     NA    NA    NA    NA   NA     NA    NA     <NULL>  <NULL> 
## 5 additi… 6.05e+0  -1271. 2562. 2562. 2600.   5.89   5.90  1.42  <NULL>  <NULL> 
## 6 multip… 5.22e-2  -1302. 2624. 2625. 2662.   6.08   6.10  0.129 <NULL>  <NULL> 
## 7 snaive… 8.21e+0     NA    NA    NA    NA   NA     NA    NA     <NULL>  <NULL> 
## 8 ARIMA   5.90e+0   -757. 1522. 1522. 1537.  NA     NA    NA     <cpl>   <cpl>  
## # … with abbreviated variable names ¹ar_roots, ²ma_roots

for (model in colnames(atm1_fit)) {

  print(atm1_fit %>%  select(model) %>%  gg_tsresiduals() +  labs(title = paste0(model," Residuals")))
}

From the report we can see that the ARIMA model has the best AICc score, but when taking a look at residuals, the Additive model without the box-cox transformation has the best distribution. Anyways, we can run them all and see which ones perform the best against our test dataset.

atm1_fc <- atm1_fit %>%
  fabletools::forecast(h = 30)

atm1_fc %>% autoplot(atm_test %>% select(ATM1))

atm1_fc %>% fabletools::accuracy(atm_test %>% select(ATM1)) %>% arrange(RMSE)

## # A tibble: 8 × 10
##   .model           .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr>            <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 additived        Test   -5.81  11.6  9.12  -61.8  65.5   NaN   NaN -0.258 
## 2 additive         Test   -5.81  11.6  9.13  -62.1  65.7   NaN   NaN -0.257 
## 3 ARIMA            Test   -8.02  12.7 10.5   -47.4  50.1   NaN   NaN -0.0936
## 4 multiplicative_l Test   -8.45  12.7 10.6   -49.3  51.6   NaN   NaN -0.0914
## 5 additive_l       Test   -8.40  12.8 10.6   -48.0  50.3   NaN   NaN -0.124 
## 6 multiplicative   Test  -10.7   15.3 13.1   -71.2  73.8   NaN   NaN -0.151 
## 7 snaive           Test   -7.07  16.0 13.5   -66.2  73.4   NaN   NaN  0.0437
## 8 snaive_l         Test  -19.4   25.2 21.6  -116.  118.    NaN   NaN  0.237

Comparing with the test data we can see that the best model was actually the additive damped without a box-cox transformation. So for the ATM1 predicted values, we will use this model.

ATM1_forcast <- atm1_fit %>%
  select(additived) %>%
  fabletools::forecast(h = 61)

ATM1_forcast <- ATM1_forcast %>% as_tibble () %>% select(Day, '.mean') %>% rename('ATM1' = '.mean' )

ATM2

atm_train %>%
  select(ATM2) %>%
  autoplot() +
  ggtitle("ATM2")

## Plot variable not specified, automatically selected `.vars = ATM2`

# STL decomposition
atm_train %>%
  select(ATM2) %>%
  model(STL(ATM2 ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM2")

for ATM 2, we see a decreasing trend (slight) and another weekly season, let’s see if it is in a similar country

atm_train %>%
  add_column(day_no = as.vector(wday(atm_train$Day, week_start = 7))) %>%
  as_tibble() %>%
  select(ATM2,day_no) %>%
  group_by(day_no) %>%
  summarise_all(mean)

## # A tibble: 7 × 2
##   day_no  ATM2
##    <dbl> <dbl>
## 1      1  66.9
## 2      2  62.7
## 3      3  79.2
## 4      4  39.3
## 5      5  19.2
## 6      6  94.6
## 7      7  76.0

We see the same weekly pattern as before. Now let’s see which model would be best.

lambda_ATM2 <- atm_train %>%
  features(ATM2, features = guerrero) %>%
  pull(lambda_guerrero)

fit <- atm_train %>%
  model(ETS(ATM2))

report(fit)

## Series: ATM2 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000102 
##     gamma = 0.3614287 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]   s[-3]    s[-4]    s[-5]    s[-6]
##  71.93059 -45.58854 -33.32942 27.05156 -2.0296 6.550105 6.425435 40.92046
## 
##   sigma^2:  672.3146
## 
##      AIC     AICc      BIC 
## 4139.704 4140.383 4177.845

Again, we are seeing ANA would work best, so let’s see which model would be best.

atm2_fit <- atm_train %>%
  model(
    # additive ETS model
    additive = ETS(ATM2 ~ error("A") + trend("N") + season("A")),
    
    # damped additive ETS model
    additived = ETS(ATM2 ~ error("A") + trend("Ad") + season("A")),
    
    # multiplicative ETS model
    multiplicative = ETS(ATM2 ~ error("M") + trend("N") + season("M")),
    
    # SNAIVE model
    snaive = SNAIVE(ATM2~ lag("week")),
    
    # transformed additive ETS model
    additive_l = ETS(box_cox(ATM2,lambda_ATM2) ~ error("A") + trend("N") + season("A")),
    
    # transformed multiplicative ETS model
    multiplicative_l = ETS(box_cox(ATM2,lambda_ATM2) ~ error("M") + trend("N") + season("M")),
    
    # transformed SNAIVE model
    snaive_l = SNAIVE(box_cox(ATM2,lambda_ATM2)~ lag("week")),
    
    # arima model
    ARIMA = ARIMA(box_cox(ATM2,lambda_ATM2))
  )

report(atm2_fit) %>% arrange(AICc)

## # A tibble: 8 × 11
##   .model   sigma2 log_lik   AIC  AICc   BIC    MSE   AMSE    MAE ar_ro…¹ ma_ro…²
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl> <list>  <list> 
## 1 ARIMA    65.7    -1151. 2315. 2315. 2338.   NA     NA   NA     <cpl>   <cpl>  
## 2 additi…  70.3    -1682. 3383. 3384. 3422.   68.4   68.6  5.80  <NULL>  <NULL> 
## 3 multip…   0.266  -1796. 3612. 3613. 3650.  111.   110.   0.393 <NULL>  <NULL> 
## 4 additi… 672.     -2060. 4140. 4140. 4178.  654.   656.  18.2   <NULL>  <NULL> 
## 5 additi… 681.     -2060. 4147. 4148. 4196.  656.   658.  18.2   <NULL>  <NULL> 
## 6 multip…   0.370  -2162. 4345. 4345. 4383. 1028.  1015.   0.469 <NULL>  <NULL> 
## 7 snaive  942.        NA    NA    NA    NA    NA     NA   NA     <NULL>  <NULL> 
## 8 snaive…  96.3       NA    NA    NA    NA    NA     NA   NA     <NULL>  <NULL> 
## # … with abbreviated variable names ¹ar_roots, ²ma_roots

for (model in colnames(atm2_fit)) {

  print(atm2_fit %>%   select(model) %>%  gg_tsresiduals() +  labs(title = paste0(model," Residuals")))
}

Again we see ARIMA with the lowest AICc, this time we also see it with the most normal residuals indicating it will be the best model, for which we will see.

atm2_fc <- atm2_fit %>%
  fabletools::forecast(h = 30)

atm2_fc %>% autoplot(atm_test %>% select(ATM2))

atm2_fc %>% fabletools::accuracy(atm_test %>% select(ATM2)) %>% arrange(RMSE)

## # A tibble: 8 × 10
##   .model           .type     ME  RMSE   MAE     MPE  MAPE  MASE RMSSE    ACF1
##   <chr>            <chr>  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 additive         Test   9.52   19.4  14.3  -25.6   57.6   NaN   NaN -0.0949
## 2 additived        Test   9.61   19.4  14.4  -23.7   56.1   NaN   NaN -0.0971
## 3 additive_l       Test   7.71   19.7  14.8  -89.9  117.    NaN   NaN -0.0507
## 4 ARIMA            Test   7.29   21.2  16.5  -90.6  119.    NaN   NaN  0.0323
## 5 snaive_l         Test   5.01   23.0  15.9    3.61  48.0   NaN   NaN -0.289 
## 6 snaive           Test  12.9    26.2  17.8   35.1   45.1   NaN   NaN -0.319 
## 7 multiplicative   Test  -0.285  48.0  41.0 -785.   828.    NaN   NaN  0.0115
## 8 multiplicative_l Test   3.87   48.0  41.2 -744.   792.    NaN   NaN  0.0234

We actually see the additive model as the best for ATM2

ATM2_forcast <- atm2_fit %>%
  select(additive) %>%
  fabletools::forecast(h = 61)

ATM2_forcast <- ATM2_forcast %>% as_tibble () %>% select(Day, '.mean') %>% rename('ATM2' = '.mean' )

ATM3

For this one, there is no historical data to either test against or get any historical data. However, since the other two ATMs have the same weekly pattern (ATM4 does too), what we can do is take the seasonality form one of the models and apply it to our data with a Naive Seasonal Trend. The other option would have been to run the mean method, but since we know these ATMs do follow some sort of seasonal cycle, as a business person, I’m sure my boss would appreciate a model that uses that seasonal cycle rather than just gives the mean answer.

I decided to use the 1st model because they had the exact same values on 4/28-4/30, so we could easily extract the trend from the data and add it to the ATM3 data. I only used April to keep it as naive as possible while accounting for the seasonality.

df33 <- atm_test %>% select(ATM3) %>% filter(ATM3 != 0)

df31 <- atm_test %>% select(ATM1) %>% filter(Day < "2010-04-28") %>% rename("ATM3" = "ATM1")

df_atm3 <- bind_rows(df33,df31)

atm3_fit <- df_atm3 %>%
  model(SNAIVE(ATM3  ~ lag("week")))

ATM3_forcast <- atm3_fit %>%
  fabletools::forecast(h = 31)

ATM3_forcast <- ATM3_forcast %>% as_tibble () %>% select(Day, '.mean') %>% rename('ATM3' = '.mean' )

ATM4

Let’s take a look at ATM4, which had the highest transaction volumes than the other three. The first thing to do is take a look at the large spike in the data and see how we should handle that. The spike occurred on 2-9-2010 and I could not find any good news stories or reasons on why it would occur in the world. so we will treat it as an anomaly and replace it with the data mean

avg <- atm_ts  %>% as_tibble() %>% select(ATM4) %>% summarise_all(mean)

atm_train$ATM4[atm_train$Day == "2010-02-09"] <- avg$ATM4[1]

atm_train %>%
  select(ATM4) %>%
  autoplot() +
  ggtitle("ATM4")

## Plot variable not specified, automatically selected `.vars = ATM4`

# STL decomposition
atm_train %>%
  select(ATM4) %>%
  model(STL(ATM4 ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM4")

atm_train %>%
  add_column(day_no = as.vector(wday(atm_train$Day, week_start = 7))) %>%
  as_tibble() %>%
  select(ATM4,day_no) %>%
  group_by(day_no) %>%
  summarise_all(mean)

## # A tibble: 7 × 2
##   day_no  ATM4
##    <dbl> <dbl>
## 1      1  550.
## 2      2  487.
## 3      3  482.
## 4      4  405.
## 5      5  148.
## 6      6  569.
## 7      7  503.

As we mentioned above, ATM4 follows the same pattern as the other ATMs. Now let’s see how the models perform (Again we see ANA), we will run the same comparisons. Looking at it the ARIMA AICc wasn’t as good as I thought it would be, so I ran some variations to see if there was a better version.

lambda_ATM4 <- atm_train %>%
  features(ATM4, features = guerrero) %>%
  pull(lambda_guerrero)

fit <- atm_train %>%
  model(ETS(ATM4))

report(fit)

## Series: ATM4 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0002820141 
##     gamma = 0.0001000387 
## 
##   Initial states:
##     l[0]      s[0]    s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  453.818 -289.4372 -49.9708 38.87142 33.06923 94.52715 53.49428 119.4459
## 
##   sigma^2:  112334.6
## 
##      AIC     AICc      BIC 
## 5854.405 5855.084 5892.546

fit <- atm_train %>%
  model(ARIMA(ATM4)) 

report(fit)

## Series: ATM4 
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean 
## 
## Coefficients:
##           ar1      ar2     ar3     ma1     ma2    sar1  constant
##       -0.9005  -0.7159  0.1525  0.9844  0.8373  0.2322  851.2132
## s.e.   0.1106   0.1480  0.0611  0.1002  0.1329  0.0594   52.9251
## 
## sigma^2 estimated as 122012:  log likelihood=-2433.77
## AIC=4883.54   AICc=4883.98   BIC=4914.05

atm4_fit <- atm_train %>%
  model(
    # additive ETS model
    additive = ETS(ATM4 ~ error("A") + trend("N") + season("A")),
    
    # damped additive ETS model
    additived = ETS(ATM4 ~ error("A") + trend("Ad") + season("A")),
    
    # multiplicative ETS model
    multiplicative = ETS(ATM4 ~ error("M") + trend("N") + season("M")),
    
    # damped multiplicative ETS model
    multiplicatived = ETS(ATM4 ~ error("M") + trend("Ad") + season("M")),
    
    # SNAIVE model
    snaive = SNAIVE(ATM4~ lag("week")),
    
    # transformed additive ETS model
    additive_l = ETS(box_cox(ATM4,lambda_ATM4) ~ error("A") + trend("N") + season("A")),
    
    # transformed multiplicative ETS model
    multiplicative_l = ETS(box_cox(ATM4,lambda_ATM4) ~ error("M") + trend("N") + season("M")),
    
    # transformed SNAIVE model
    snaive_l = SNAIVE(box_cox(ATM4,lambda_ATM4)~ lag("week")),
    
    # arima model (3,0,2)(1,0,0)
    ARIMA_r = ARIMA(box_cox(ATM4,lambda_ATM4)),
    
    # arima model (3,0,2)(1,1,0)
    ARIMA_110 = ARIMA(box_cox(ATM4,lambda_ATM4) ~ pdq(3,0,2) + PDQ(1,1,0)),
    
    # arima model (3,0,1)(1,0,0)
    ARIMA_301 = ARIMA(box_cox(ATM4,lambda_ATM4) ~ pdq(3,0,1) + PDQ(1,0,0)),
    
    # arima model (3,0,2)(1,0,0) + constant 0
    ARIMA_c0 = ARIMA(box_cox(ATM4,lambda_ATM4) ~ 0 + pdq(3,0,2) + PDQ(1,0,0)),
    
    # arima model (3,0,2)(1,0,0) + constant 1
    ARIMA_c1 = ARIMA(box_cox(ATM4,lambda_ATM4) ~ 1 + pdq(3,0,2) + PDQ(1,0,0)),
  )

report(atm4_fit) %>% arrange(AICc)

## # A tibble: 13 × 11
##    .model       sigma2 log_lik   AIC  AICc   BIC     MSE    AMSE     MAE ar_ro…¹
##    <chr>         <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl> <list> 
##  1 ARIMA_r     1.03e+2  -1250. 2509. 2509. 2524. NA      NA       NA     <cpl>  
##  2 ARIMA_c1    1.04e+2  -1250. 2516. 2517. 2547. NA      NA       NA     <cpl>  
##  3 ARIMA_301   1.06e+2  -1254. 2522. 2522. 2548. NA      NA       NA     <cpl>  
##  4 ARIMA_110   1.26e+2  -1257. 2528. 2528. 2554. NA      NA       NA     <cpl>  
##  5 ARIMA_c0    1.07e+2  -1259. 2531. 2532. 2558. NA      NA       NA     <cpl>  
##  6 additive_l  9.46e+1  -1731. 3483. 3484. 3521.  9.21e1  9.23e1   7.69  <NULL> 
##  7 multiplica… 1.95e-1  -1733. 3485. 3486. 3523.  9.61e1  9.64e1   0.345 <NULL> 
##  8 additive    1.12e+5  -2917. 5854. 5855. 5893.  1.09e5  1.10e5 264.    <NULL> 
##  9 multiplica… 7.53e-1  -2922. 5864. 5865. 5902.  1.30e5  1.31e5   0.638 <NULL> 
## 10 additived   1.15e+5  -2919. 5864. 5865. 5914.  1.11e5  1.11e5 265.    <NULL> 
## 11 multiplica… 7.35e-1  -2926. 5878. 5879. 5928.  1.34e5  1.35e5   0.627 <NULL> 
## 12 snaive      2.13e+5     NA    NA    NA    NA  NA      NA       NA     <NULL> 
## 13 snaive_l    1.69e+2     NA    NA    NA    NA  NA      NA       NA     <NULL> 
## # … with 1 more variable: ma_roots <list>, and abbreviated variable name
## #   ¹ar_roots

for (model in colnames(atm4_fit)) {

  print(atm4_fit %>%   select(model) %>%  gg_tsresiduals() +  labs(title = paste0(model," Residuals")))
}

Looking at the AICc, the ARIMA models perform similarly, but with residuals,

atm4_fc <- atm4_fit %>%
  fabletools::forecast(h = 30)

atm4_fc %>% autoplot(atm_test %>% select(ATM4), level = NULL)

atm4_fc %>% fabletools::accuracy(atm_test %>% select(ATM4)) %>% arrange(RMSE)

## # A tibble: 13 × 10
##    .model           .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE      ACF1
##    <chr>            <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>     <dbl>
##  1 multiplicative   Test   -54.9  244.  208.  -636.  656.   NaN   NaN  0.117   
##  2 multiplicatived  Test   -53.6  244.  208.  -635.  656.   NaN   NaN  0.122   
##  3 multiplicative_l Test   -99.7  249.  210.  -642.  657.   NaN   NaN  0.165   
##  4 ARIMA_r          Test   -62.3  259.  223.  -828.  849.   NaN   NaN -0.0138  
##  5 ARIMA_c0         Test   -67.9  276.  238.  -959.  981.   NaN   NaN  0.000708
##  6 ARIMA_301        Test   -67.0  278.  240.  -950.  972.   NaN   NaN -0.00588 
##  7 ARIMA_c1         Test   -67.8  282.  241.  -968.  991.   NaN   NaN -0.0343  
##  8 additive_l       Test   -54.1  301.  258. -1162. 1195.   NaN   NaN  0.166   
##  9 snaive           Test  -130.   301.  227.  -360.  375.   NaN   NaN  0.0298  
## 10 additive         Test   -52.4  308.  261. -1226. 1259.   NaN   NaN  0.119   
## 11 additived        Test   -44.8  309.  262. -1213. 1248.   NaN   NaN  0.130   
## 12 ARIMA_110        Test  -226.   360.  280.  -712.  718.   NaN   NaN  0.304   
## 13 snaive_l         Test  -589.   729.  603. -1222. 1223.   NaN   NaN  0.420

In this case, it looks like the MNA method performed the best, so we will use that for the forecast.

ATM4_forcast <- atm4_fit %>%
  select(multiplicative) %>%
  fabletools::forecast(h = 61)

ATM4_forcast <- ATM4_forcast %>% as_tibble () %>% select(Day, '.mean') %>% rename('ATM4' = '.mean' )

Part B

Part B – Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

We will start out this analysis in the same way by looking at the data

head(dpower)

## # A tibble: 6 × 3
##    Case Date         KWH
##   <dbl> <chr>      <dbl>
## 1   733 1998-Jan 6862583
## 2   734 1998-Feb 5838198
## 3   735 1998-Mar 5420658
## 4   736 1998-Apr 5010364
## 5   737 1998-May 4665377
## 6   738 1998-Jun 6467147

paste0("The min date is: ",min(dpower$Date)," and the max date is: ",max(dpower$Date))

## [1] "The min date is: 1998-Apr and the max date is: 2013-Sep"

par(mfrow=c(4,1))
for (i in p_names[c(1,3)]) {
  print(summary(dpower[i]))
  boxplot(dpower[i], horizontal = TRUE, main=i)
}

##       Case      
##  Min.   :733.0  
##  1st Qu.:780.8  
##  Median :828.5  
##  Mean   :828.5  
##  3rd Qu.:876.2  
##  Max.   :924.0

##       KWH          
##  Min.   :  770523  
##  1st Qu.: 5429912  
##  Median : 6283324  
##  Mean   : 6502475  
##  3rd Qu.: 7620524  
##  Max.   :10655730  
##  NA's   :1

Case is straightforward but it does look like KWH has some outliers. Before we proceed, I want to make sure that Case and Dates are unique, then check for NAs on all columns.

sum(duplicated(dpower$Case))

## [1] 0

sum(duplicated(dpower$Date))

## [1] 0

for (i in p_names) {
  print(i)
  print(sum(is.na(dpower[i]))) 
}

## [1] "Case"
## [1] 0
## [1] "Date"
## [1] 0
## [1] "KWH"
## [1] 1

dpower %>% filter_all(any_vars(is.na(.)))

## # A tibble: 1 × 3
##    Case Date       KWH
##   <dbl> <chr>    <dbl>
## 1   861 2008-Sep    NA

Great, we can see there are no duplicates, so we do not need to worry about cleaning the columns. But there is one missing day in the data, we will replace with the mean, then we will create a box-cox transformed column for the KHW data.

Note: I later realized I missed some dates that were not in the data. So we will be adding those in as well here

dpower <- dpower %>% mutate(across(KWH, ~replace_na(., mean(., na.rm=TRUE))))

power_ts <- dpower %>%
  mutate(Date = yearmonth(Date)) %>%
  as_tsibble(index = Date) %>%
  fill_gaps(.full = TRUE)

Now let’s plot and see what we are dealing with using a seasonal plot so we can see the whole and component for each column:

# STL decomposition
power_ts %>%
  select(KWH) %>%
  model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for KWH")

We can see there is a lot of seasonal data, but also somewhere in Jan there is a large decrease in KWH. Since it is a one time event, I don’t think we would have to remove it because it should not weigh on the data too much (and it does not affect the seasonality. Like with the ATM, we will first split the data, then we will run the data against itself.

However, this time, we will then take the winning model and run it on the whole data set to predict since now we only have 12 periods instead of 52.

power_train <- power_ts %>%  filter(Case < 900 )
min(power_train$Date)

## <yearmonth[1]>
## [1] "1998 Jan"

max(power_train$Date)

## <yearmonth[1]>
## [1] "2011 Nov"

power_test <- power_ts %>%  filter(Case >= 900)
min(power_test$Date)

## <yearmonth[1]>
## [1] "2011 Dec"

max(power_test$Date)

## <yearmonth[1]>
## [1] "2013 Dec"

fit <- power_train %>%
  model(ETS(KWH))

report(fit)

## Series: KWH 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1321936 
##     gamma = 0.0001000993 
## 
##   Initial states:
##     l[0]     s[0]     s[-1]     s[-2]   s[-3]    s[-4]    s[-5]     s[-6]
##  6129953 0.917436 0.7497227 0.8826468 1.19224 1.264716 1.208195 0.9891802
##      s[-7]    s[-8]     s[-9]   s[-10]  s[-11]
##  0.7679627 0.815219 0.9221161 1.077275 1.21329
## 
##   sigma^2:  0.0144
## 
##      AIC     AICc      BIC 
## 5385.722 5388.901 5432.492

fit <- power_train %>%
  model(ARIMA(KWH)) 

report(fit)

## Series: KWH 
## Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ma1     sar1     sar2   constant
##       0.1352  -0.9763  -0.5591  178547.98
## s.e.  0.0811   0.0801   0.0894   79349.56
## 
## sigma^2 estimated as 6.774e+11:  log likelihood=-2336.62
## AIC=4683.24   AICc=4683.64   BIC=4698.45

We can see the best fit for ETS is a MNM model, we will run that and tha ANM. For ARIMA, we will run (1,0,0)(0,1,1)[12], some alternatives, and also an auto model to see which one is best.

power_fit <- power_train %>%
  model(
    # additive ETS model
    additive = ETS(KWH ~ error("A") + trend("N") + season("A")),
    
    # multiplicative ETS model
    multiplicative = ETS(KWH ~ error("M") + trend("N") + season("M")),
    
   # arima model (1,0,1)(0,1,1)
    ARIMA_r = ARIMA(KWH),
    
    # arima model (1,0,1)(0,2,1)
    ARIMA_021 = ARIMA(KWH ~ pdq(1,0,1) + PDQ(0,2,1)),
    
     # arima model (1,0,1)(0,1,2)
    ARIMA_012 = ARIMA(KWH ~ pdq(1,0,1) + PDQ(0,1,2)),
    
    # arima model (1,0,1)(0,1,1) + constant 0
    ARIMA_c0 = ARIMA(KWH ~ 0 + pdq(1,0,1) + PDQ(0,1,1)),
    
    # arima model (1,0,1)(0,1,1) + constant 1
    ARIMA_c1 = ARIMA(KWH ~ 1 + pdq(1,0,1) + PDQ(0,1,1))
  )

report(power_fit) %>% arrange(AICc)

## # A tibble: 7 × 11
##   .model     sigma2 log_lik   AIC  AICc   BIC      MSE     AMSE      MAE ar_ro…¹
##   <chr>       <dbl>   <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl> <list> 
## 1 ARIMA_0… 1.49e+12  -2221. 4450. 4450. 4462. NA       NA       NA       <cpl>  
## 2 ARIMA_r  6.77e+11  -2337. 4683. 4684. 4698. NA       NA       NA       <cpl>  
## 3 ARIMA_0… 7.12e+11  -2339. 4689. 4690. 4708. NA       NA       NA       <cpl>  
## 4 ARIMA_c1 7.39e+11  -2343. 4696. 4697. 4711. NA       NA       NA       <cpl>  
## 5 ARIMA_c0 7.66e+11  -2345. 4699. 4699. 4711. NA       NA       NA       <cpl>  
## 6 multipl… 1.44e- 2  -2678. 5386. 5389. 5432.  6.97e11  7.40e11  7.64e-2 <NULL> 
## 7 additive 7.41e+11  -2702. 5434. 5438. 5481.  6.79e11  7.00e11  5.09e+5 <NULL> 
## # … with 1 more variable: ma_roots <list>, and abbreviated variable name
## #   ¹ar_roots

for (model in colnames(power_fit)) {

  print(power_fit %>%   select(model) %>%  gg_tsresiduals() +  labs(title = paste0(model," Residuals")))
}

power_fc <- power_fit %>%
  fabletools::forecast(h = 24)

power_fc %>% autoplot(power_test %>% select(KWH), level = NULL)

power_fc %>% fabletools::accuracy(power_test %>% select(KWH)) %>% arrange(RMSE)

## # A tibble: 7 × 10
##   .model         .type       ME     RMSE    MAE     MPE  MAPE  MASE RMSSE   ACF1
##   <chr>          <chr>    <dbl>    <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 multiplicative Test    43856.  670564. 4.95e5 -0.0621  6.66   NaN   NaN 0.0738
## 2 additive       Test   367469.  801350. 6.04e5  3.93    7.74   NaN   NaN 0.177 
## 3 ARIMA_c1       Test   459176.  806384. 5.92e5  5.41    7.48   NaN   NaN 0.166 
## 4 ARIMA_c0       Test   747267.  997737. 7.67e5  9.53    9.81   NaN   NaN 0.165 
## 5 ARIMA_021      Test  -347092. 1006604. 7.69e5 -4.54    9.64   NaN   NaN 0.0350
## 6 ARIMA_012      Test   530606. 1063905. 7.01e5  6.28    8.78   NaN   NaN 0.250 
## 7 ARIMA_r        Test   490946. 1253331. 8.02e5  5.57   10.1    NaN   NaN 0.190

We can see that the Multiplicative model is the best, so we will train with that and output the forecast

power_fit_m <- power_ts %>%
  model(# multiplicative ETS model
    ETS(KWH ~ error("M") + trend("N") + season("M")))

report(power_fit_m)

## Series: KWH 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1113039 
##     gamma = 0.0001001479 
## 
##   Initial states:
##     l[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]     s[-6]
##  6134546 0.9570497 0.7499494 0.8695777 1.188998 1.260456 1.197843 0.9963632
##      s[-7]     s[-8]     s[-9]   s[-10]   s[-11]
##  0.7655965 0.8115011 0.9123226 1.068165 1.222177
## 
##   sigma^2:  0.0144
## 
##      AIC     AICc      BIC 
## 6223.827 6226.555 6272.690

power_m_fc <- power_fit_m %>%
  fabletools::forecast(h = 12)


power_m_fc %>% autoplot(power_ts %>% select(KWH))

Project 1

Joshua Hummell

3/21/2023

Project 1

Part A

ATM1

ATM2

ATM3

ATM4

Part B