library(fpp3)
## Warning: package 'tsibble' was built under R version 4.3.2
library(forecast)
library(imputeTS)
This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Apr 11. I will accept late submissions with a penalty until the meetup after that when we review some projects.
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file.
The variable Cash is provided in hundreds of dollars,
other than that it is straight forward.
I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast.
I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file
Also please submit the forecast which you will put in an Excel readable file.
ATM <- readxl::read_excel("ATM624Data.xlsx",col_types = c("date","text", "numeric"))
head(ATM)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dttm> <chr> <dbl>
## 1 2009-05-01 00:00:00 ATM1 96
## 2 2009-05-01 00:00:00 ATM2 107
## 3 2009-05-02 00:00:00 ATM1 82
## 4 2009-05-02 00:00:00 ATM2 89
## 5 2009-05-03 00:00:00 ATM1 85
## 6 2009-05-03 00:00:00 ATM2 90
summary(ATM)
## DATE ATM Cash
## Min. :2009-05-01 00:00:00.00 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 00:00:00.00 Class :character 1st Qu.: 0.5
## Median :2009-11-01 00:00:00.00 Mode :character Median : 73.0
## Mean :2009-10-31 19:11:48.27 Mean : 155.6
## 3rd Qu.:2010-02-01 00:00:00.00 3rd Qu.: 114.0
## Max. :2010-05-14 00:00:00.00 Max. :10919.8
## NA's :19
ATM$DATE = as.Date(ATM$DATE)
There are 19 NA cash values in this dataset. I subset the dataset further to inspect the NA rows and determine if it needs to be imputed or removed from the data set.
ATM[is.na(ATM$Cash), ]
## # A tibble: 19 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-18 ATM2 NA
## 4 2009-06-22 ATM1 NA
## 5 2009-06-24 ATM2 NA
## 6 2010-05-01 <NA> NA
## 7 2010-05-02 <NA> NA
## 8 2010-05-03 <NA> NA
## 9 2010-05-04 <NA> NA
## 10 2010-05-05 <NA> NA
## 11 2010-05-06 <NA> NA
## 12 2010-05-07 <NA> NA
## 13 2010-05-08 <NA> NA
## 14 2010-05-09 <NA> NA
## 15 2010-05-10 <NA> NA
## 16 2010-05-11 <NA> NA
## 17 2010-05-12 <NA> NA
## 18 2010-05-13 <NA> NA
## 19 2010-05-14 <NA> NA
For the ATMs that are missing Cash values on the given days, it is possible that there was no cash recieved. There are so few missing values that it could be removed from the dataset.
ATM[is.na(ATM$ATM), ]
## # A tibble: 14 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-05-01 <NA> NA
## 2 2010-05-02 <NA> NA
## 3 2010-05-03 <NA> NA
## 4 2010-05-04 <NA> NA
## 5 2010-05-05 <NA> NA
## 6 2010-05-06 <NA> NA
## 7 2010-05-07 <NA> NA
## 8 2010-05-08 <NA> NA
## 9 2010-05-09 <NA> NA
## 10 2010-05-10 <NA> NA
## 11 2010-05-11 <NA> NA
## 12 2010-05-12 <NA> NA
## 13 2010-05-13 <NA> NA
## 14 2010-05-14 <NA> NA
The ATMs that are NA could be due to data error. It would be best to remove them as they do not give any important information necessary for imputing.
The data is a tibble object and needs to be converted to a
tsibble object in order to perform the time series
analysis.
ATM_ts <- ATM |>
mutate(DATE = ymd(DATE)) |>
as_tsibble(key = c(ATM),
index = DATE)
The time series plot shows the four different ATMs available in the dataset. The fourth ATM has a large increase in cash between January and April of 2010. The other ATMs all have relativly low Cash values.
ATM_ts |>
autoplot(Cash)
## Warning: Removed 14 rows containing missing values (`geom_line()`).
A facet wrap allows us to see the timeseries clearer. There also appears
to be NA vale
ATM_ts |>
autoplot(Cash) +
facet_wrap(~ATM, ncol=1)
## Warning: Removed 14 rows containing missing values (`geom_line()`).
Taking a closer look at ATM1 and its time series plot, there is not obvious increase or decrease in the trend of the time series. There does appear to be a seasonlaity and the variation looks relatively stable.
ATM_ts |>
filter(ATM == "ATM1") |>
autoplot(Cash)
Next, in order to apply models to this dataset, I filtered for ATM1
and made it its own time series object called ATM1.
ATM1<- ATM_ts |>
filter(ATM == "ATM1") |>
update_tsibble(key = c()) |>
select(-(ATM))
ATM1 |>
head()
## # A tsibble: 6 x 2 [1D]
## DATE Cash
## <date> <dbl>
## 1 2009-05-01 96
## 2 2009-05-02 82
## 3 2009-05-03 85
## 4 2009-05-04 90
## 5 2009-05-05 99
## 6 2009-05-06 88
By using filter index I am able to subset the time series to data before May 2010.
ATM1_train <- ATM_ts |>
filter(ATM == "ATM1") |>
filter_index("2009-05-01" ~ "2010-04-30") |>
select(Cash)
NA values can impact forecasting, especially with forecast models. There are three NA values in the Cash column of the ATM1 time series.
sum(is.na(ATM1_train$Cash ) )
## [1] 3
I will use the median of the Cash variable to replace the missing values.
As a basic forecast, I applied SNAIVE,
Mean, Naive and Drift models on
the data.
ATM1_train$Cash[is.na(ATM1_train$Cash)] <- median(ATM1_train$Cash, na.rm = T)
ATM1_fit <- ATM1_train |>
model(
Snaive = SNAIVE(Cash),
Mean = MEAN(Cash),
`Naïve` = NAIVE(Cash),
Drift = NAIVE(Cash ~ drift()))
ATM1_fc <- ATM1_fit |>
forecast(h=31)
ATM1_fc |>
autoplot(ATM1_train) +
autolayer(
filter_index(ATM1, "2010-05-01" ~ .),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = Cash`
From all four basic models, the best model by RMSE is SNAIVE model.
ATM1_fit |>
accuracy()
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Snaive Training -3.63e- 2 27.8 17.7 -96.5 116. 1 1 0.159
## 2 Mean Training -1.56e-15 36.5 27.1 -175. 198. 1.53 1.31 0.0595
## 3 Naïve Training -3.02e- 2 50.1 37.5 -132. 167. 2.12 1.80 -0.365
## 4 Drift Training -3.84e-17 50.1 37.5 -132. 167. 2.12 1.80 -0.365
Next, I compared the models above to an ETS model. Using
the ets function, it outputs the optimal ETS parameters and
selects the best AICc.
ets(ATM1_train$Cash)
## ETS(M,N,N)
##
## Call:
## ets(y = ATM1_train$Cash)
##
## Smoothing parameters:
## alpha = 0.0322
##
## Initial states:
## l = 79.126
##
## sigma: 0.4351
##
## AIC AICc BIC
## 4783.131 4783.198 4794.831
ets(ATM1_train$Cash) |>
accuracy()
## ME RMSE MAE MPE MAPE MASE
## Training set -0.07129336 36.63921 27.3546 -178.5067 202.2401 0.7286553
## ACF1
## Training set 0.03799582
There is no clear linear or cyclic trend in the data overall.
lambda <-ATM1_train |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
fit <- ATM1_train |>
model(
arima = ARIMA(box_cox(Cash, lambda)))
report(fit) |>
accuracy()
## Series: Cash
## Model: ARIMA(0,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda)
##
## Coefficients:
## ma1 ma2 sma1
## 0.0994 -0.1141 -0.6456
## s.e. 0.0524 0.0527 0.0426
##
## sigma^2 estimated as 1.576: log likelihood=-589.77
## AIC=1187.54 AICc=1187.66 BIC=1203.07
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Training 2.34 25.0 16.0 -87.9 107. 0.906 0.900 0.0116
fit |> select(arima) %>% gg_tsresiduals()
The residuals all fall within the boundary in the ACF plot and the
residuals show a better distribution (closer to normal) than the
ets function.
An Arima with a boxcox transformation had the lowest error compared to the other function. I will be using this to forecast results for the submission of this project.
ATM1_fc <- fit |>
forecast(h=31)
data.frame(Date = ATM1_fc$DATE, Cash = ATM1_fc$.mean) |>
write.csv("Project1_PartB_ATM1.csv")
For ATM2 the periods are much shorter in this time series compared to the ATM1.
ATM2<- ATM_ts |>
filter(ATM == "ATM2") |>
update_tsibble(key = c()) |>
select(-(ATM))
ATM2_train <- ATM_ts |>
filter(ATM == "ATM2") |>
filter_index("2009-05-01" ~ "2010-04-30") |>
select(Cash)
ATM2_fit <- ATM2_train |>
model(MEAN(Cash))
ATM2_fc <- ATM2_fit |>
forecast(h=31)
ATM2_fc |>
autoplot(ATM2_train) +
autolayer(
filter_index(ATM2, "2010-05-01" ~ .),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = Cash`
sum(is.na(ATM2$Cash ) )
## [1] 2
There are two missing values in the Cash variable in ATM2.
ATM2[!complete.cases(ATM2), ]
## # A tsibble: 2 x 2 [1D]
## DATE Cash
## <date> <dbl>
## 1 2009-06-18 NA
## 2 2009-06-24 NA
ATM2_train <- ATM_ts |>
filter(ATM == "ATM2") |>
filter_index("2009-05-01" ~ "2010-04-30") |>
select(Cash)
ATM2_train$Cash[is.na(ATM2_train$Cash)] <- median(ATM2_train$Cash, na.rm = T)
ATM2_fit <- ATM2_train |>
model(
Snaive = SNAIVE(Cash),
Mean = MEAN(Cash),
`Naïve` = NAIVE(Cash),
Drift = NAIVE(Cash ~ drift()))
ATM2_fc <- ATM2_fit |>
forecast(h=31)
ATM2_fc |>
autoplot(ATM2_train) +
autolayer(
filter_index(ATM2, "2010-05-01" ~ .),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = Cash`
ATM2_fit |>
accuracy()
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Snaive Training 2.23e- 2 30.1 20.8 -Inf Inf 1 1 0.0447
## 2 Mean Training 2.92e-15 38.7 33.1 -Inf Inf 1.59 1.29 0.0215
## 3 Naïve Training -4.67e- 2 54.2 42.5 -Inf Inf 2.04 1.80 -0.309
## 4 Drift Training 2.15e-15 54.2 42.5 -Inf Inf 2.04 1.80 -0.309
ATM2_train |>
features(Cash, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
ATM2_train |>
gg_tsdisplay(difference(Cash, 30),
plot_type='partial', lag=36) +
labs(title="Seasonally differenced", y="")
## Warning: Removed 30 rows containing missing values (`geom_line()`).
## Warning: Removed 30 rows containing missing values (`geom_point()`).
The significant lag at 1 and a differencing of 1.
fit <- ATM2_train |>
model(
arima012011 = ARIMA(Cash ~ pdq(0,1,2) + PDQ(0,1,1)),
arima210011 = ARIMA(Cash ~ pdq(2,1,0) + PDQ(0,1,1)),
auto = ARIMA(Cash, stepwise = FALSE, approx = FALSE)
)
fit |> pivot_longer(everything(), names_to = "Model name",
values_to = "Orders")
## # A mable: 3 x 2
## # Key: Model name [3]
## `Model name` Orders
## <chr> <model>
## 1 arima012011 <ARIMA(0,1,2)(0,1,1)[7]>
## 2 arima210011 <ARIMA(2,1,0)(0,1,1)[7]>
## 3 auto <ARIMA(2,0,2)(0,1,1)[7]>
glance(fit) |> arrange(AICc) |> select(.model:BIC)
## # A tibble: 3 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 auto 602. -1653. 3319. 3319. 3342.
## 2 arima012011 639. -1664. 3336. 3336. 3351.
## 3 arima210011 869. -1715. 3438. 3439. 3454.
fit |> select(auto) |> gg_tsresiduals(lag=36)
forecast(fit, h=31) |>
filter(.model=='auto') |>
autoplot(ATM2_train) +
labs(title = "ATM4 Withdrawals",
y="Cash (hundreds)")
fit |>
accuracy()
## # A tibble: 3 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima012011 Training -0.434 24.9 17.3 -Inf Inf 0.829 0.826 -0.00407
## 2 arima210011 Training 0.0155 29.0 20.2 NaN Inf 0.970 0.963 -0.129
## 3 auto Training -0.885 24.1 17.0 -Inf Inf 0.819 0.801 -0.00512
lambda <-ATM2_train |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
fit <- ATM2_train |>
model(
arima = ARIMA(box_cox(Cash, lambda)))
report(fit) |>
accuracy()
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4247 -0.9074 0.468 0.8004 -0.7262
## s.e. 0.0587 0.0438 0.089 0.0584 0.0409
##
## sigma^2 estimated as 69.93: log likelihood=-1267.92
## AIC=2547.85 AICc=2548.09 BIC=2571.13
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Training 0.243 24.4 17.1 -Inf Inf 0.823 0.811 -0.0123
ATM2_fc <- fit |>
forecast(h=31)
ATM2_fc |>
autoplot(ATM2_train) +
autolayer(
filter_index(ATM2, "2010-05-01" ~ .),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = Cash`
ATM2_train |> gg_tsdisplay(difference(box_cox(Cash,lambda), 30),
plot_type='partial', lag_max = 36)
## Warning: Removed 30 rows containing missing values (`geom_line()`).
## Warning: Removed 30 rows containing missing values (`geom_point()`).
fit <- ATM2_train |>
model(
arima_box_cox = ARIMA(box_cox(Cash,lambda) ~ pdq(2,0,2) + PDQ(0,1,1)),
)
fit |> pivot_longer(everything(), names_to = "Model name",
values_to = "Orders")
## # A mable: 1 x 2
## # Key: Model name [1]
## `Model name` Orders
## <chr> <model>
## 1 arima_box_cox <ARIMA(2,0,2)(0,1,1)[7]>
glance(fit) |> arrange(AICc) |> select(.model:BIC)
## # A tibble: 1 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima_box_cox 69.9 -1268. 2548. 2548. 2571.
fit |> select(arima_box_cox) |> gg_tsresiduals(lag=36)
report(fit) |>
accuracy()
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4247 -0.9074 0.468 0.8004 -0.7262
## s.e. 0.0587 0.0438 0.089 0.0584 0.0409
##
## sigma^2 estimated as 69.93: log likelihood=-1267.92
## AIC=2547.85 AICc=2548.09 BIC=2571.13
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima_box_cox Training 0.243 24.4 17.1 -Inf Inf 0.823 0.811 -0.0123
fit <- ATM2_train |>
model(ARIMA(Cash, stepwise = FALSE, approx = FALSE)
)
ATM2_csv <- forecast(fit, h=31)
data.frame(`Date` = ATM2_csv$DATE, Cash = ATM2_csv$.mean) |>
write.csv("Project1_PartA_ATM2.csv")
ATM_ts |>
filter(ATM == "ATM3") |>
head()
## # A tsibble: 6 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM3 0
## 2 2009-05-02 ATM3 0
## 3 2009-05-03 ATM3 0
## 4 2009-05-04 ATM3 0
## 5 2009-05-05 ATM3 0
## 6 2009-05-06 ATM3 0
ATM3 has 365 values and of all of these values only 3 have Cash values. Every other day had 0 Cash withdrawn. Because of the lack of variation in the cash withdrawals, this time series would not provide accurate forecasts.
sum(is.na(ATM2$Cash ) )
## [1] 2
ATM3 <- ATM_ts |>
filter(ATM == "ATM3") |>
update_tsibble(key = c()) |>
select(-(ATM))
ATM3_train <- ATM_ts |>
filter(ATM == "ATM3") |>
filter_index("2009-05-01" ~ "2010-04-30") |>
select(Cash)
ATM3_fit <- ATM3_train |>
model(MEAN(Cash))
ATM3_fc <- ATM3_fit |>
forecast(h=31)
ATM3_fc |>
autoplot(ATM3_train) +
autolayer(
filter_index(ATM3, "2010-05-01" ~ .),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = Cash`
ATM4 <- ATM_ts |>
filter(ATM == "ATM4") |>
update_tsibble(key = c()) |>
select(-(ATM))
ATM4 |>
head()
## # A tsibble: 6 x 2 [1D]
## DATE Cash
## <date> <dbl>
## 1 2009-05-01 777.
## 2 2009-05-02 524.
## 3 2009-05-03 793.
## 4 2009-05-04 908.
## 5 2009-05-05 52.8
## 6 2009-05-06 52.2
sum(is.na(ATM4$Cash ) )
## [1] 0
The fourth ATM has no NA values.
ATM4_train <- ATM_ts |>
filter(ATM == "ATM4") |>
filter_index("2009-05-01" ~ "2010-04-30") |>
select(Cash)
ATM4_fit <- ATM4_train |>
model(MEAN(Cash))
ATM4_fc <- ATM4_fit |>
forecast(h=31)
ATM4_fc |>
autoplot(ATM4_train) +
autolayer(
filter_index(ATM4, "2010-05-01" ~ .),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = Cash`
The Time series does have a significant outlier in the amount of cash
withdrawn between January and April 2010. This outlier will affect the
forecasts so to address this, I will impute the outliers.
By setting the robust parameter to true, STL decomposition can be more robust to outliers.
dcmp <- ATM4_train |>
model(
STL(Cash ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) |>
components() |>
select(-.model)
dcmp |>
model(NAIVE(season_adjust)) |>
forecast() |>
autoplot(dcmp) +
labs(y = "Number of people",
title = "US retail employment")
There is no obvious trend, but there is an obvious seasonality.
fit_dcmp <- ATM4_train |>
model(stlf = decomposition_model(
STL(Cash ~ trend(window = 7), robust = TRUE),
NAIVE(season_adjust)
))
fit_dcmp |>
forecast() |>
autoplot(ATM4_train)+
labs(y = "Cash (hundreds)",
title = "ATM4 Withdrawal")
fit_dcmp |> gg_tsresiduals()
## Warning: Removed 7 rows containing missing values (`geom_line()`).
## Warning: Removed 7 rows containing missing values (`geom_point()`).
## Warning: Removed 7 rows containing non-finite values (`stat_bin()`).
fit_dcmp |>
accuracy()
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stlf Training -0.490 907. 375. -320. 453. 0.934 1.01 -0.497
impute using the tsimpute function
ATM4_train2 <- imputeTS::na_interpolation(ATM4_train)
ATM4_fit2 <- ATM4_train2 |>
model(
Snaive = SNAIVE(Cash),
Mean = MEAN(Cash),
`Naïve` = NAIVE(Cash),
Drift = NAIVE(Cash ~ drift()),
auto = ARIMA(Cash))
ATM4_fc2 <- ATM4_fit2 |>
forecast(h=31)
ATM4_fc2 |>
autoplot(ATM4_train2) +
autolayer(
filter_index(ATM4, "2010-05-01" ~ .),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = Cash`
ATM4_fit2 |>
accuracy()
## # A tibble: 5 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Snaive Training -3.70e+ 0 896. 402. -395. 450. 1 1 -0.0151
## 2 Mean Training 1.99e-14 650. 324. -617. 647. 0.805 0.725 -0.00936
## 3 Naïve Training -8.10e- 1 925. 444. -555. 615. 1.10 1.03 -0.488
## 4 Drift Training -1.20e-15 925. 444. -554. 614. 1.10 1.03 -0.488
## 5 auto Training -1.51e-10 650. 324. -617. 647. 0.805 0.725 -0.00936
The mean and the auto arima function give the best forecast in terms of RMSE for the imputation. I will use the auto arima function for the forecast.
data.frame(`Date` = ATM4_fc2$DATE, Cash = ATM4_fc2$.mean) |>
write.csv("Project1_PartA_ATM4.csv")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.
Add this to your existing files above.
residential_power <- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx")
summary(residential_power)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
head(residential_power)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
tail(residential_power)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 919 2013-Jul 8415321
## 2 920 2013-Aug 9080226
## 3 921 2013-Sep 7968220
## 4 922 2013-Oct 5759367
## 5 923 2013-Nov 5769083
## 6 924 2013-Dec 9606304
rp_ts <- residential_power |>
select(`YYYY-MMM`, KWH) |>
mutate(YearMonth = yearmonth(`YYYY-MMM`)) |>
as_tsibble(index = YearMonth) |>
select(-`YYYY-MMM`)
head(rp_ts)
## # A tsibble: 6 x 2 [1M]
## KWH YearMonth
## <dbl> <mth>
## 1 6862583 1998 Jan
## 2 5838198 1998 Feb
## 3 5420658 1998 Mar
## 4 5010364 1998 Apr
## 5 4665377 1998 May
## 6 6467147 1998 Jun
rp_train <- rp_ts |>
filter_index("1998 Jan" ~ "2013 Oct")
rp_train |>
head()
## # A tsibble: 6 x 2 [1M]
## KWH YearMonth
## <dbl> <mth>
## 1 6862583 1998 Jan
## 2 5838198 1998 Feb
## 3 5420658 1998 Mar
## 4 5010364 1998 Apr
## 5 4665377 1998 May
## 6 6467147 1998 Jun
According to the summary, there is only one NA value in the dataset. According to the table below, the missing value is for KWH.
rp_train[is.na(rp_train$KWH), ]
## # A tsibble: 1 x 2 [1M]
## KWH YearMonth
## <dbl> <mth>
## 1 NA 2008 Sep
rp_train |>
ggplot(aes(KWH)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
Based on this graph, and the outlier at the beginning, I will use the median to impute the data.
rp_train <- na_interpolation(rp_train)
rp_train |>
head()
## # A tsibble: 6 x 2 [1M]
## KWH YearMonth
## <dbl> <mth>
## 1 6862583 1998 Jan
## 2 5838198 1998 Feb
## 3 5420658 1998 Mar
## 4 5010364 1998 Apr
## 5 4665377 1998 May
## 6 6467147 1998 Jun
rp_train |>
autoplot()
## Plot variable not specified, automatically selected `.vars = KWH`
There is an upward trend in the data and a strong seasonality. There
also is an outlier after January 2010 that may impact the data.
dcmp <- rp_train |>
model(stl = STL(KWH ))
components(dcmp) |>
as_tsibble() |>
autoplot(KWH, colour="gray") +
geom_line(aes(y=trend), colour = "#D55E00") +
labs(
y = "KWH",
title = ""
)
components(dcmp) |> autoplot()
rp_fit <- rp_train |>
model(
Snaive = SNAIVE(KWH),
Mean = MEAN(KWH),
`Naïve` = NAIVE(KWH),
Drift = NAIVE(KWH ~ drift()))
rp_fc <- rp_fit |>
forecast(h=12)
rp_fc |>
autoplot(rp_train) +
autolayer(
filter_index(rp_ts, "2013 Dec" ~ .),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = KWH`
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
rp_fit |>
accuracy()
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Snaive Training 7.83e+ 4 1163215. 687493. -4.16 14.6 1 1 0.240
## 2 Mean Training -4.12e-10 1428917. 1175505. -7.79 22.0 1.71 1.23 0.435
## 3 Naïve Training -5.84e+ 3 1522098. 1188389. -5.94 21.9 1.73 1.31 0.0279
## 4 Drift Training -4.24e-10 1522087. 1187988. -5.84 21.9 1.73 1.31 0.0279
Of the four models, the best model based on the RMSE is the SNAIVE model.
rp_train |>
gg_tsdisplay(KWH,
plot_type='partial', lag=36) +
labs(title="Seasonally differenced", y="")
rp_train |>
features(KWH, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
rp_train |>
gg_tsdisplay(difference(KWH,12),
plot_type='partial', lag=36) +
labs(title="Seasonally differenced", y="")
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
fit <- rp_train |>
model(
arima012011 = ARIMA(KWH ~ pdq(0,1,1) + PDQ(0,1,1)),
arima210011 = ARIMA(KWH ~ pdq(1,1,0) + PDQ(0,1,1)),
auto = ARIMA(KWH, stepwise = FALSE, approx = FALSE)
)
fit |> pivot_longer(everything(), names_to = "Model name",
values_to = "Orders")
## # A mable: 3 x 2
## # Key: Model name [3]
## `Model name` Orders
## <chr> <model>
## 1 arima012011 <ARIMA(0,1,1)(0,1,1)[12]>
## 2 arima210011 <ARIMA(1,1,0)(0,1,1)[12]>
## 3 auto <ARIMA(1,0,0)(0,1,1)[12] w/ drift>
glance(fit) |> arrange(AICc) |> select(.model:BIC)
## # A tibble: 3 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima012011 765139451910. -2680. 5365. 5366. 5375.
## 2 auto 707997072666. -2687. 5381. 5382. 5394.
## 3 arima210011 930636243374. -2698. 5401. 5401. 5411.
lambda <- rp_train |>
features(KWH, features = guerrero) |>
pull(lambda_guerrero)
fit <- rp_train |>
model(
arima = ARIMA(box_cox(KWH, lambda)))
report(fit) |>
accuracy()
## Series: KWH
## Model: ARIMA(0,0,1)(0,0,2)[12] w/ mean
## Transformation: box_cox(KWH, lambda)
##
## Coefficients:
## ma1 sma1 sma2 constant
## 0.1378 0.1367 0.1227 2.4179
## s.e. 0.0726 0.0729 0.0662 0.0001
##
## sigma^2 estimated as 2.321e-07: log likelihood=1183.38
## AIC=-2356.77 AICc=-2356.44 BIC=-2340.53
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Training 259048. 1294645. 965398. -2.93 17.9 1.40 1.11 0.297
From the above analysis, I will be using the SNAIVE model to predict.
rp_fit <- rp_train |>
model(
Snaive = SNAIVE(KWH))
rp_fc <- rp_fit |>
forecast(h=14)
rp_fc |>
autoplot(rp_train) +
autolayer(
filter_index(rp_ts, "2013 Dec" ~ .),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = KWH`
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
data.frame(`YYYY-MMM` = rp_fc$YearMonth, KWH = rp_fc$.mean) |>
write.csv("Project1_PartB.csv")
Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.