My attempt at the first project
First, we will import the relevant libaries and read the excel file.
library(tidyverse)
library(readxl)
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.3.2
## Warning: package 'tsibble' was built under R version 4.3.2
## Warning: package 'tsibbledata' was built under R version 4.3.2
## Warning: package 'feasts' was built under R version 4.3.2
## Warning: package 'fabletools' was built under R version 4.3.2
## Warning: package 'fable' was built under R version 4.3.2
df = read_excel("C:\\Users\\Al Haque\\Downloads\\ATM624Data.xlsx")
head(df)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
Here we have to convert the logged data-time into our proper data-time format so we have to specify the origin.
df$DATE <- as.Date(df$DATE, origin = "1899-12-30")
df
## # A tibble: 1,474 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-02 ATM1 82
## 4 2009-05-02 ATM2 89
## 5 2009-05-03 ATM1 85
## 6 2009-05-03 ATM2 90
## 7 2009-05-04 ATM1 90
## 8 2009-05-04 ATM2 55
## 9 2009-05-05 ATM1 99
## 10 2009-05-05 ATM2 79
## # ℹ 1,464 more rows
Our dataset looks much better now..
Let’s now explore our data and see what insights we can gather
head(df)
## # A tibble: 6 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-02 ATM1 82
## 4 2009-05-02 ATM2 89
## 5 2009-05-03 ATM1 85
## 6 2009-05-03 ATM2 90
There are columns with the date,the type of ATM and the cash amount, we have 4 atms with varying cash amounts, I am assuming we will have to forecast values for each type of ATM.
Let’s use the skimr library to skim our dataset.
library(skimr)
skim(df)
| Name | df |
| Number of rows | 1474 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ATM | 14 | 0.99 | 4 | 4 | 0 | 4 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| DATE | 0 | 1 | 2009-05-01 | 2010-05-14 | 2009-11-01 | 379 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Cash | 19 | 0.99 | 155.58 | 376.46 | 0 | 0.5 | 73 | 114 | 10919.76 | ▇▁▁▁▁ |
Looking at the dataset we have 1474 observations and we have 14 missing values under ATM and 19 missing values under Cash. Interestingly, while the average of cash column is 155 the max value is 10919 which can skew our data..
Let’s try to locate our missing values in our dataset so we can better understand what exactly is going on.
df[is.na(df$Cash), ]
## # A tibble: 19 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-18 ATM2 NA
## 4 2009-06-22 ATM1 NA
## 5 2009-06-24 ATM2 NA
## 6 2010-05-01 <NA> NA
## 7 2010-05-02 <NA> NA
## 8 2010-05-03 <NA> NA
## 9 2010-05-04 <NA> NA
## 10 2010-05-05 <NA> NA
## 11 2010-05-06 <NA> NA
## 12 2010-05-07 <NA> NA
## 13 2010-05-08 <NA> NA
## 14 2010-05-09 <NA> NA
## 15 2010-05-10 <NA> NA
## 16 2010-05-11 <NA> NA
## 17 2010-05-12 <NA> NA
## 18 2010-05-13 <NA> NA
## 19 2010-05-14 <NA> NA
It appears the missing ATM values are the values we will have to predict,for the assignment, thus we can split the dataset before May 1st of 2010, however there are only five missing values we will have to impute we can use the mean to impute this.
Let’s next visualize the data.
ggplot(df,aes(x = DATE, y = Cash)) +
geom_line() + facet_wrap(~ATM,scales = "free",nrow = 4)
## Warning: Removed 14 rows containing missing values (`geom_line()`).
So the outlier is appears in ATM4 and ATM3 seems to have mostly zero values until it increases at the end, we also have to remove our NA values.
Now that we have explored our data, we now have to clean our data. The following objectives are
df_ts <- df %>%
as_tsibble(index = DATE,key = ATM) %>%
filter_index("2009-05-01" ~ "2010-04-30")
df_ts
## # A tsibble: 1,460 x 3 [1D]
## # Key: ATM [4]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-02 ATM1 82
## 3 2009-05-03 ATM1 85
## 4 2009-05-04 ATM1 90
## 5 2009-05-05 ATM1 99
## 6 2009-05-06 ATM1 88
## 7 2009-05-07 ATM1 8
## 8 2009-05-08 ATM1 104
## 9 2009-05-09 ATM1 87
## 10 2009-05-10 ATM1 93
## # ℹ 1,450 more rows
Next, we will impute the missing values with the mean.., checking the missing values we can impute the missing values with their respective averages for each ATM.
df_ts[is.na(df_ts$Cash), ]
## # A tsibble: 5 x 3 [1D]
## # Key: ATM [2]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
df %>%
group_by(ATM) %>%
filter(!is.na(Cash)) %>%
summarise(mean(Cash))
## # A tibble: 4 × 2
## ATM `mean(Cash)`
## <chr> <dbl>
## 1 ATM1 83.9
## 2 ATM2 62.6
## 3 ATM3 0.721
## 4 ATM4 474.
For ATM1 we can imput those missing values with 83 and for ATM2 we can impute it with 62.57
This is imputing it individually for ATM1
## replace this observation with 83
df_ts[44,3] <- 83
df_ts[47,3] <- 83
df_ts[53,3] <- 83
Now for ATM2
df_ts[414,3] <- 63
df_ts[420,3] <- 63
Let’s quickly check and it looks good no missing values..
df_ts %>%
filter(is.na(Cash))
## # A tsibble: 0 x 3 [?]
## # Key: ATM [0]
## # ℹ 3 variables: DATE <date>, ATM <chr>, Cash <dbl>
Now we will replace the outlier in ATM4 with the median value..
df_ts[which.max(df_ts$Cash),]
## # A tsibble: 1 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-02-09 ATM4 10920.
df %>%
filter(!is.na(Cash)) %>%
group_by(ATM) %>%
summarise(median(Cash))
## # A tibble: 4 × 2
## ATM `median(Cash)`
## <chr> <dbl>
## 1 ATM1 91
## 2 ATM2 67
## 3 ATM3 0
## 4 ATM4 404.
We will replace the outlier value of 10919.76 with 408
df_ts[1380,3] <- 403.84
Now we have cleaned our data the way we want it, before we began modeling.. let’s quickly visualize the data.
df_ts %>%
autoplot(Cash) + facet_wrap(~ATM,scales = "free",nrow = 4)
The visualization looks good to me and the apparent outlier is gone from ATM4.
Our next step will be to build models for each type of ATM.., we will begin with building a model for ATM1.
atm1 <- df_ts %>%
filter(ATM == "ATM1")
atm1 %>%
autoplot() + labs(title = "ATM1 Cash Flow")
## Plot variable not specified, automatically selected `.vars = Cash`
The data has sharp peaks and drops which could indicate seasonality.., and the variance appears to be very high for this particular ATM. I may attempt to transform the data using a box_cox to see if the variance could be stabilized and then use the transformed data to build our models.
Let’s first try to transform our data with a box_cox transformation
Atm1lambda <- atm1 |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
## Use the title code from the textbook to illustrate lambda value
atm1 %>%
autoplot(box_cox(Cash,Atm1lambda)) + labs(title = latex2exp::TeX(paste0(
"Transformed cash values with $\\lambda$ = ",
round(Atm1lambda,2))))
We have a lambda of 0.23 which indicates a possible log-transformation, the cash values have a prominent peak and drop, but now we can use this transformed data for our model.
Let’s try to build some models, if we look at the plot we can see that the data has some sort of seasonality but with no apparent trend
atm1_fc <- atm1 %>%
model(`seasonal_naive` = SNAIVE(box_cox(Cash,Atm1lambda)),
additive = ETS(box_cox(Cash,Atm1lambda) ~ error("A") + trend("N") + season("A")),
multiplicative = ETS(box_cox(Cash,Atm1lambda) ~ error("M") + trend("N") + season("M")),
stepwise = ARIMA(box_cox(Cash,Atm1lambda)))
glance(atm1_fc) %>%
arrange(AICc) %>%
select(.model:BIC)
## # A tibble: 4 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stepwise 1.49 -579. 1167. 1167. 1182.
## 2 additive 1.52 -1149. 2317. 2318. 2356.
## 3 multiplicative 0.0367 -1186. 2393. 2393. 2432.
## 4 seasonal_naive 2.10 NA NA NA NA
accuracy(atm1_fc) %>%
arrange(RMSE)
## # A tibble: 4 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 multiplicative Train… 3.94 23.5 15.5 -29.4 49.4 0.872 0.844 0.117
## 2 ATM1 additive Train… 2.51 25.1 16.2 -91.6 111. 0.908 0.900 0.115
## 3 ATM1 stepwise Train… 2.40 25.1 16.2 -87.8 107. 0.907 0.900 0.0105
## 4 ATM1 seasonal_naive Train… -0.0363 27.9 17.8 -96.6 116. 1 1 0.150
Looking at both of the diagnostics from glance() and accuracy() functions it seems that the ARIMA model has the lowest AICc but the multiplicative ETS model had the lowest RMSE, however, in this instance I will choose the arima model for it’s lowest AICc value since the AICc favors simpler models over complex models to prevent over fitting.
Let’s take a look at the ARIMA model
atm1_fc %>% select(.model = "stepwise") %>% report()
## Series: Cash
## Model: ARIMA(0,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, Atm1lambda)
##
## Coefficients:
## ma1 ma2 sma1
## 0.0946 -0.1158 -0.6475
## s.e. 0.0525 0.0530 0.0424
##
## sigma^2 estimated as 1.487: log likelihood=-579.38
## AIC=1166.76 AICc=1166.87 BIC=1182.28
For the first atm ARIMA(0,0,2)(0,1,1) was modeled on the first atm..
Let’s forecast the values with the stepwise model, we will forecast the missing 14 days
atm1_fcc <- atm1_fc %>%
forecast(h = "14 days")%>%
filter(.model == "stepwise")
## Let's plot the atm forecast
atm1_fcc %>%
autoplot(atm1) + labs(title = "Forecasts with ARIMA model")
## Let's check the diagnostics model for the ARIMA
atm1_fc %>%
select(stepwise) %>%
gg_tsresiduals()
The residuals appear to look white-noise since there are no significant spikes and the residuals do not look normally distributed it has a left-tail skew
# ljung-box test, l is 4 since 2* 7 is 14 and l = 7 since we have day to day data recall dof is p+q and 0 + 2 is so since l is 7 and arimia p and q is 3 dof = 7-3 = 4
atm1_fc %>%
select(.model = "stepwise") %>%
augment() %>%
features(.innov, ljung_box, lag = 14, dof = 4)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 .model 11.0 0.354
The p-value is not significant so we can say that the residuals look like white-noise.
Let’s focus on ATM2 now
atm2 <- df_ts %>%
filter(ATM == "ATM2")
atm2 %>%
autoplot(Cash) + labs(title = "ATM2 Cash Flow")
Atm2 looks pretty smiliar to atm1, with sharp peaks and falls we can also witness seasonality within the data and there is no trend within this atm as well.
As we did before with ATM1, we will transform the second atm using a box_cox transformation
Atm2lambda <- atm2 |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
## Use the title code from the textbook to illustrate lambda value
atm2 %>%
autoplot(box_cox(Cash,Atm2lambda)) + labs(title = latex2exp::TeX(paste0(
"Transformed cash values with $\\lambda$ = ",
round(Atm2lambda,2))))
This time around the lambda value is closer to 0.72 which is a lot closer to a sq-rt transformation, i do not remember, we can see that the variance is a lot more evenly spread this time around.
Let’s try to build some models now for atm2, if we look at the plot we can see that the data has some seasonality but with no apparent trend, we will simply use the same types of models as we did before with atm1
## Let's build the model for atm2 now but we aren't gonna change any models
atm2_fc <- atm2 %>%
model(`seasonal_naive` = SNAIVE(box_cox(Cash,Atm2lambda)),
additive = ETS(box_cox(Cash,Atm2lambda) ~ error("A") + trend("N") + season("A")),
multiplicative = ETS(box_cox(Cash,Atm2lambda) ~ error("M") + trend("N") + season("M")),
stepwise = ARIMA(box_cox(Cash,Atm2lambda)))
glance(atm2_fc) %>%
arrange(AICc) %>%
select(.model:BIC)
## # A tibble: 4 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stepwise 67.8 -1262. 2537. 2537. 2560.
## 2 additive 71.9 -1852. 3725. 3726. 3764.
## 3 multiplicative 0.308 -2012. 4044. 4044. 4083.
## 4 seasonal_naive 99.7 NA NA NA NA
accuracy(atm2_fc) %>%
arrange(RMSE)
## # A tibble: 4 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM2 stepwise Trai… 0.254 24.4 17.1 -Inf Inf 0.824 0.812 -0.0129
## 2 ATM2 additive Trai… 0.506 25.3 17.7 -Inf Inf 0.853 0.841 0.0161
## 3 ATM2 seasonal_naive Trai… 0.0223 30.1 20.8 -Inf Inf 1 1 0.0458
## 4 ATM2 multiplicative Trai… 5.16 34.0 27.0 -Inf Inf 1.30 1.13 0.00933
This is interesting, for atm2 it seems the stepwise arima model has the lower RMSE and AICc, we will use this model to forecast our values then.
Let’s take a look at the ARIMA model
atm2_fc %>% select(.model = "stepwise") %>% report()
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, Atm2lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4245 -0.9068 0.4682 0.7994 -0.7252
## s.e. 0.0587 0.0440 0.0889 0.0585 0.0409
##
## sigma^2 estimated as 67.82: log likelihood=-1262.41
## AIC=2536.83 AICc=2537.07 BIC=2560.11
For the second atm ARIMA(2,0,2)(0,1,1) was modeled on the second atm..
Let’s forecast the values with the stepwise model, we will forecast the missing 14 days as usual
atm2_fcc <- atm2_fc %>%
forecast(h = "14 days")%>%
filter(.model == "stepwise")
## Let's plot the atm forecast
atm2_fcc %>%
autoplot(atm2) + labs(title = "Forecasts with ARIMA model for Atm 2")
## Let's check the diagnostics model for the ARIMA atm2
atm2_fc %>%
select(stepwise) %>%
gg_tsresiduals()
The residuals appear to look like white-noise since there is only 1 significant lag and the residuals looks normally distributed.
# ljung-box test, l is 4 since 2* 7 is 14 and l = 7 since we have day to day data recall dof is p+q so dof is 5 according to arima
# plot so 7-5 = 2
atm2_fc %>%
select(.model = "stepwise") %>%
augment() %>%
features(.innov, ljung_box, lag = 14, dof = 2)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 .model 10.5 0.574
The p-value is not significant so we can say that the residuals look like white-noise.
For atm3 since we have a bunch of zeros for the majority of its values and some values I figured to just build an SES(simple exponential smoothing model) and a Naive model.
atm3 <- df_ts %>%
filter(ATM == "ATM3")
atm3 %>%
autoplot(Cash)
This data is not ideal for building a model but we can build some simple models and produce some forecasts
atm3_fc <- atm3 %>%
model(
NAIVE(Cash),
SNAIVE(Cash),
MEAN(Cash),
ets = ETS(Cash ~ error("A") + trend("N") + season("N"))
)
report(atm3_fc)
## Warning in report.mdl_df(atm3_fc): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.
## # A tibble: 4 × 10
## ATM .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM3 NAIVE(Cash) 25.9 NA NA NA NA NA NA NA
## 2 ATM3 SNAIVE(Cash) 64.3 NA NA NA NA NA NA NA
## 3 ATM3 MEAN(Cash) 63.1 NA NA NA NA NA NA NA
## 4 ATM3 ets 25.4 -1666. 3338. 3338. 3350. 25.3 44.3 0.273
atm3_pre <- atm3_fc %>%
forecast(h = "14 days") %>%
filter(.model == "ets")
accuracy(atm3_fc) %>%
arrange(RMSE)
## # A tibble: 4 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM3 ets Trai… 2.70e- 1 5.03 0.273 Inf Inf 0.371 0.625 -0.00736
## 2 ATM3 NAIVE(Cas… Trai… 2.34e- 1 5.09 0.310 28.8 40.2 0.423 0.632 -0.149
## 3 ATM3 MEAN(Cash) Trai… -5.63e-17 7.93 1.43 -Inf Inf 1.95 0.986 0.640
## 4 ATM3 SNAIVE(Ca… Trai… 7.35e- 1 8.04 0.735 100 100 1 1 0.640
Only way to compare it, is by RMSE so it seems that ETS has the lower RMSE than the other benchmark method which is not surprising.. this is as far as I will get for ATM3 since the data is not that good to begin with. So we will use ETS model
atm3_pre %>%
autoplot(atm3) + labs(title = "Forecast values for atm3")
atm3_fc %>%
select(.model = ets) %>%
gg_tsresiduals()
No residuals even worth looking at for the third atm..
As always the same procedure
atm4 <- df_ts %>%
filter(ATM == "ATM4")
atm4 %>%
autoplot(Cash) + labs(title = "ATM4 Cash Values")
Looks like we have sharp seasonality and no apparent trend just like the same as atm1 and atm2.
We will transform the data again using box_cox and then use the transformed data to make our model.
Atm4lambda <- atm4 |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
## Use the title code from the textbook to illustrate lambda value
atm4 %>%
autoplot(box_cox(Cash,Atm4lambda)) + labs(title = latex2exp::TeX(paste0(
"Transformed cash values with $\\lambda$ = ",
round(Atm4lambda,2))))
This time around the lambda value is 0.45 as always we see that the variance is a lot more evenly spread.
Let’s try to build some models now for atm4, if we look at the plot we can see that the data has some seasonality but with no apparent trend, we will simply use the same types of models as we did before with atm1 and atm2
## Let's build the model for atm2 now but we aren't gonna change any models
atm4_fc <- atm4 %>%
model(`seasonal_naive` = SNAIVE(box_cox(Cash,Atm4lambda)),
additive = ETS(box_cox(Cash,Atm4lambda) ~ error("A") + trend("N") + season("A")),
multiplicative = ETS(box_cox(Cash,Atm4lambda) ~ error("M") + trend("N") + season("M")),
stepwise = ARIMA(box_cox(Cash,Atm4lambda)))
glance(atm4_fc) %>%
arrange(AICc) %>%
select(.model:BIC)
## # A tibble: 4 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stepwise 182. -1466. 2943. 2943. 2962.
## 2 multiplicative 0.225 -2009. 4038. 4039. 4077.
## 3 additive 173. -2012. 4044. 4045. 4083.
## 4 seasonal_naive 298. NA NA NA NA
accuracy(atm4_fc) %>%
arrange(RMSE)
## # A tibble: 4 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 additive Training 67.2 338. 260. -377. 421. 0.750 0.747 0.0974
## 2 ATM4 multiplicative Training 65.2 346. 264. -364. 408. 0.763 0.764 0.0921
## 3 ATM4 stepwise Training 84.4 352. 274. -371. 415. 0.792 0.778 0.0200
## 4 ATM4 seasonal_naive Training -3.70 452. 346. -392. 447. 1 1 0.0602
For atm4, the stepwise arima model has the lowest AICc not surprisingly but the lowest RMSe in this case was the additive ETS model for atm4, which is pretty cool, however I will resort to using the arima model since it has the lowest AICc again.
Let’s take a look at the ARIMA model as always
atm4_fc %>% select(.model = "stepwise") %>% report()
## Series: Cash
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean
## Transformation: box_cox(Cash, Atm4lambda)
##
## Coefficients:
## ma1 sar1 sar2 constant
## 0.0769 0.2088 0.1996 17.1613
## s.e. 0.0528 0.0516 0.0525 0.7422
##
## sigma^2 estimated as 182.2: log likelihood=-1466.35
## AIC=2942.7 AICc=2942.87 BIC=2962.2
For the fourth atm ARIMA(0,0,1)(2,1,0) was modeled on the second atm..
Let’s forecast the values with the stepwise model, we will forecast the missing 14 days as usual
atm4_fcc <- atm4_fc %>%
forecast(h = "14 days")%>%
filter(.model == "stepwise")
## Let's plot the atm forecast
atm4_fcc %>%
autoplot(atm4) + labs(title = "Forecasts with ARIMA model for Atm 4")
## Let's check the diagnostics model for the ARIMA atm2
atm4_fc %>%
select(stepwise) %>%
gg_tsresiduals()
The residuals appear to look like white-noise since there is only 1 significant lag and the residuals looks normally distributed.
# ljung-box test, lag since 2* 7 is 14 and l = 7 since we have day to day data recall dof is p+q and 1 + 2 = 3 and dof is 7-3 = 4
atm4_fc %>%
select(.model = "stepwise") %>%
augment() %>%
features(.innov, ljung_box, lag = 14, dof = 4)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 .model 15.9 0.103
The p-value is not significant so we can say that the residuals look like white-noise.
saveatm1 <- atm1_fcc %>%
as.data.frame() %>%
select(ATM,DATE,.mean)
saveatm2 <- atm2_fcc %>%
as.data.frame() %>%
select(ATM,DATE,.mean)
saveatm3 <- atm3_pre %>%
as.data.frame() %>%
select(ATM,DATE,.mean)
saveatm4 <- atm4_fcc %>%
as.data.frame() %>%
select(ATM,DATE,.mean)
## The original file had three columns and each atm was placed on rows upon rows..
final_val <- bind_rows(saveatm1,saveatm2,saveatm3,saveatm4)
A thing to note is I am not sure if the predictions needed to be backtransformed since I used a box_cox transformation when I used the models, so I just left the predictions as it is.
library(writexl)
## Warning: package 'writexl' was built under R version 4.3.3
## Save it to a excel file
write_xlsx(final_val,'atm_predictions.xlsx')
For the atm model it appeared that we had 4 different atms with their own respective values. I noticed that the atm1 and atm2 looks fairly polished but had a few missing values. For atm3 we see it had a bunch of sparse values with a few values at the end I was loathe to discarding this data since it may have some information. Atm4 had an outlier which I removed by imputing it with the median since the mean was skewed by the outlier, and I also individually imputed the other atm’s value with their respective atm medians since the missing values were pretty small,and thus the data was cleaned. As for model-building It seems for atm1,2,4 the ARIMA Model seems to do really well, it had the lowest AICc value compared to any other models such as ETS and the other benchmark methods. It was pretty interesting that when comparing the models regarding RMSE, the multiplicative ETS model had the lowest RMSE for atm1, while the stepwise arima model had the lowest for atm2, and the additive ets model had the lowest rmse for atm4. Regardless, the arima model seem to perform better regarding AICc metric since we prefer to fit a simpler model over a complex one which is good in a business perspective. For atm3 it was difficult to compare since the data was mostly sparse and had a few values but it was not surprising that simple exponential smoothing performed better than the other benchmark methods i.e (Seasonal naive,naive and mean methods..). Thus for atm1,atm2,and atm4 I chose the arima model with the box_cox transformation while for atm3 I chose simple exponential smoothing.
Here is part II of the first project.
Here we will read the data again for the power forecasting and follow the same workflow where we will tidy,visualize and forecast our values. Here we have to model these data and produce a monthly forecast for 2014.
df2 = read_excel("C:\\Users\\Al Haque\\Downloads\\ResidentialCustomerForecastLoad-624.xlsx")
df2
## # A tibble: 192 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
## 7 739 1998-Jul 8914755
## 8 740 1998-Aug 8607428
## 9 741 1998-Sep 6989888
## 10 742 1998-Oct 6345620
## # ℹ 182 more rows
Reading the data-file it looks pretty good so far
Let’s explore the power_df and visualize the data to see if we need to clean anything.
summary(df2)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
The case sequence perhaps represent the case number and the monthly kwh numbers, so we should put our focus on the kwh column. We should adjust the YYYY column into a datetime column and we need to impute the single NA value..
Let’s take a look at the distribution of kwh
hist(df2$KWH)
Seems like the distribution of the kwh are in the mid to top range of power usage..
df2 %>%
filter(is.na(KWH))
## # A tibble: 1 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 861 2008-Sep NA
Seems like the empty KWH value is during 2008 of September and is CaseSequence # 861.
Let’s clean the data now in this portion. The following objectives are:
## Okay my precious attempt at cleaning the time column didn't work, chapter 2.1 tells us how to create proper tstibble objects
df2 <- df2 %>%
rename(Time = `YYYY-MMM`) %>%
mutate(Time = yearmonth(Time))
Here we have properly fixed the column and renamed the column into time.
Now we can visualize the kwh column properly as a line graph.
ggplot(df2,aes(x = Time,y = KWH)) +
geom_line() + labs(title = "KWH Monthly Usage")
Taking a look at the plot we can see that there is a slight positive trend, and we can see strong seasonality per month, though we can notice an outlier at around the point of 2010.
Let’s impute the missing NA value with the median now.
df2 %>%
filter((is.na(KWH)))
## # A tibble: 1 × 3
## CaseSequence Time KWH
## <dbl> <mth> <dbl>
## 1 861 2008 Sep NA
The missing value is located at 08 of September,let’s try to find the median now.
summary(df2$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
The median value is 6283324, thus we can impute the missing value with this, we can manually impute since it is an single obsservation.
df2[129,3] <- 6283324
summary(df2$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 770523 5434539 6283324 6501333 7608792 10655730
Let’s now identify the outlier observation in our dataframe so it is case 883, I wonder what happened on this particular day?, Regardless I chose the 1st quartile value since it represents the lowest variance of values within this time-series, this can help preserve the seasonality (I believe..)
df2[which.min(df2$KWH),]
## # A tibble: 1 × 3
## CaseSequence Time KWH
## <dbl> <mth> <dbl>
## 1 883 2010 Jul 770523
## replace this value with the 1st quartile
df2[151,3] <- 5434539
Now we remove the extra column and create the df as a tstibble with the time as index
clean_df2 <- df2 %>%
as_tsibble(index = Time)
and now the dataframe looks good..
Let’s attempt to build some models, first let’s take a look again at we are working with
clean_df2 %>%
autoplot(KWH) + labs(title = "KWH Monthly Values")
From the time series plot of the KWH, the data shows strong seasonality, with a slight positive trend, we also note the imputed 1st quartile of the observation
Let’s transform our data with a box_cox transformation, perhaps it can stabilize that sharp decline around 2010
## Setting the key to casesequence in as.tstibble() renders an error in this function so I just left it as key = Time for above
KWHlambda <- clean_df2 |>
features(KWH, features = guerrero) |>
pull(lambda_guerrero)
clean_df2 %>%
autoplot(box_cox(KWH,KWHlambda)) + labs(title = "KWH Box-Cox Transformation")
The transformation doesn’t appear to have change the variance at all..
Let’s try to build some models, I may build some models without the box_cox transformation
## Model building
df2_fc <- clean_df2 %>%
model(`seasonal_naivet` = SNAIVE(box_cox(KWH,KWHlambda)),
`seasonal_naive` = SNAIVE(KWH),
additivet = ETS(box_cox(KWH,KWHlambda) ~ error("A") + trend("A") + season("A")),
additive = ETS(KWH ~ error("A") + trend("A") + season("A")),
multiplicativet = ETS(box_cox(KWH,KWHlambda) ~ error("M") + trend("N") + season("M")),
multiplicative = ETS(KWH ~ error("M") + trend("A") + season("M")),
stepwiset = ARIMA(box_cox(KWH,KWHlambda)),
stepwise = ARIMA(KWH))
It’s not shown here but it appears that I can not build a ETS model with dampened parameters, so they got removed.
glance(df2_fc) %>%
arrange(AICc) %>%
select(.model:BIC)
## # A tibble: 8 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stepwiset 8.80e- 6 804. -1598. -1597. -1582.
## 2 additivet 7.81e- 6 633. -1231. -1228. -1176.
## 3 multiplicativet 4.42e- 7 629. -1227. -1224. -1178.
## 4 stepwise 4.45e+11 -2672. 5354. 5354. 5370.
## 5 multiplicative 1.00e- 2 -3063. 6160. 6164. 6215.
## 6 additive 4.74e+11 -3077. 6188. 6192. 6244.
## 7 seasonal_naivet 1.33e- 5 NA NA NA NA
## 8 seasonal_naive 7.90e+11 NA NA NA NA
accuracy(df2_fc) %>%
arrange(RMSE)
## # A tibble: 8 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stepwise Traini… -7864. 6.38e5 4.60e5 -0.898 7.00 0.710 0.716 -0.00659
## 2 additivet Traini… 38791. 6.54e5 4.68e5 -0.336 7.11 0.723 0.734 0.286
## 3 multiplicative Traini… 32164. 6.57e5 4.77e5 -0.378 7.23 0.735 0.737 0.313
## 4 additive Traini… 11959. 6.59e5 4.77e5 -0.751 7.26 0.736 0.739 0.278
## 5 multiplicativet Traini… 83776. 6.74e5 4.98e5 0.451 7.53 0.769 0.756 0.221
## 6 stepwiset Traini… 65135. 6.83e5 5.18e5 0.325 7.87 0.799 0.767 0.128
## 7 seasonal_naivet Traini… 94245. 8.91e5 6.48e5 0.573 9.64 1 1 0.300
## 8 seasonal_naive Traini… 94245. 8.91e5 6.48e5 0.573 9.64 1 1 0.300
I am feeling a bit conflicted on what model to choose the transformed stepwise arima model performed better AICc and BIC but the oridnary stepwise function performed better RMSE wise, though since the textbook told us to choose the model with the lowest AICc I will choose the transformed stepwise arima model.
df2_fc %>%
select(.model = "stepwiset") %>%
report()
## Series: KWH
## Model: ARIMA(1,0,0)(2,1,0)[12] w/ drift
## Transformation: box_cox(KWH, KWHlambda)
##
## Coefficients:
## ar1 sar1 sar2 constant
## 0.2454 -0.7524 -0.3888 7e-04
## s.e. 0.0745 0.0724 0.0730 2e-04
##
## sigma^2 estimated as 8.799e-06: log likelihood=803.92
## AIC=-1597.84 AICc=-1597.49 BIC=-1581.87
We have an arima model with the box_cox transformation with (1,1,0) and (2,1,0)
Let’s forecast the values with the stepwise model, we will forecast the 2014 and 2015 monthly values
df2_fcc <- df2_fc %>%
forecast(h = "24 months")%>%
filter(.model == "stepwiset")
## Let's plot the atm forecast
df2_fcc %>%
autoplot(clean_df2) + labs(title = "Forecasts with ARIMA Transformed for 2014")
## Let's check the diagnostics
df2_fc %>%
select(stepwiset) %>%
gg_tsresiduals()
The residuals appear to look white-noise since there is only 2 significant lags and the residuals looks kinda normally distributed
# m = 12 for monthly
# ljung-box test, since 2* 12 is 24 and m = 12 since we have monthly data recall dof is p+q
## Tweaking the dof makes the residuals signficant so I just excluded the dof parameter to default.
df2_fc %>%
select(.model = "stepwiset") %>%
augment() %>%
features(.innov, ljung_box, lag = 24)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 .model 34.7 0.0729
The residuals are not significant but it is nearly.
## Make the forecasts the same as the predictions..
df2_final_val <-
df2_fcc %>%
as.data.frame() %>%
mutate(Casesequence = 925:948) %>%
select(Casesequence,.mean,Time) %>%
rename(KWH = .mean)
library(writexl)
## Save it to a excel file
write_xlsx(df2_final_val,'C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 624\\KWH_forecast2014.xlsx')
For part B we had to clean the kwh dataframe by renaming the column time column and convert it to a proper datetime format, which I had help from the hyundman textbook (2.1), we had to impute the missing value with the median since the outlier was skewing the mean value and replaced the missing value with the 1st quartile value since I believed that the 25th percentile value helps accurately represent the seasonality of the data. Regarding the model, I tried a variety of models from simple benchmarks such as snaive, to ETS and ARIMA. I even played with transformed and not transformed data but ultimately chose the lowest AICc model which was the transformed stepwise arima model. The arima model created an ARIMA model of (1,0,0) and (2,1,0). I’ve struggled with the pormanteau test in this part since tweaking the dof made the residuals value significant so I just left the dof value by default, which made the residuals not significant though by not that much. Once again, I haven’t back transformed the predictions so I left the values in it’s box_cox transformed values.