library(readxl)
library(fpp3)
## -- Attaching packages ------------------------------------------------------------------------------------------------------- fpp3 0.4.0 --
## v tibble 3.1.1 v tsibble 1.1.1
## v dplyr 1.0.6 v tsibbledata 0.4.0
## v tidyr 1.1.3 v feasts 0.2.2
## v lubridate 1.7.4 v fable 0.3.0
## v ggplot2 3.3.5
## -- Conflicts ------------------------------------------------------------------------------------------------------------ fpp3_conflicts --
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag() masks stats::lag()
## x tsibble::new_interval() masks lubridate::new_interval()
## x tsibble::setdiff() masks base::setdiff()
## x tsibble::union() masks base::union()
library(DataExplorer)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v readr 1.3.1 v stringr 1.4.0
## v purrr 0.3.2 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x tsibble::intersect() masks lubridate::intersect(), base::intersect()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag() masks stats::lag()
## x tsibble::new_interval() masks lubridate::new_interval()
## x tsibble::setdiff() masks lubridate::setdiff(), base::setdiff()
## x tsibble::union() masks lubridate::union(), base::union()
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Part A – ATM Forecast, ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual .rmd file. Also please submit the forecast which you will put in an Excel readable file.
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
We will first load the data, which I have saved in a local folder, and explore it.
data <- read_excel("ATM624Data.xlsx", col_names = T, col_types = c('date', 'guess', 'guess'))
The data contain the variables “DATE, ATM and Cash”.
head(data)
## # A tibble: 6 x 3
## DATE ATM Cash
## <dttm> <chr> <dbl>
## 1 2009-05-01 00:00:00 ATM1 96
## 2 2009-05-01 00:00:00 ATM2 107
## 3 2009-05-02 00:00:00 ATM1 82
## 4 2009-05-02 00:00:00 ATM2 89
## 5 2009-05-03 00:00:00 ATM1 85
## 6 2009-05-03 00:00:00 ATM2 90
As we can observe in the table above, the variable “DATE” is of data type “POSIXct”. We will have to convert it to date in order to work with the data.
summary(data)
## DATE ATM Cash
## Min. :2009-05-01 00:00:00 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 00:00:00 Class :character 1st Qu.: 0.5
## Median :2009-11-01 00:00:00 Mode :character Median : 73.0
## Mean :2009-10-31 19:11:48 Mean : 155.6
## 3rd Qu.:2010-02-01 00:00:00 3rd Qu.: 114.0
## Max. :2010-05-14 00:00:00 Max. :10919.8
## NA's :19
We can also see that the max value under the “Cash” variable is 10919.8. Since the number is in hundreds of dollars, when multiplying this by 100 we get 1,091,980, and it is quite unlikely that any ATM machine would be able to dispense this kind of cash. This is most likely an error and we will address it accordingly.
plot_missing(data)
There are a few missing values for “ATM” but I believe these were left blank on purpose in order to enter the predictions later on for the month of May. Additionally, there are also a few missing values from “Cash” that we may need to address.
We will create a separate data set in order to maintain the original data and make all the necessary transformations there. First we’ll transform the “DATE” variables into its corresponding data type date. We will also make “Cash” an integer as we know that ATMs do not dispense cents.
atm <- data %>%
mutate(DATE = as_date(DATE), Cash = as.integer(Cash))
str(atm)
## tibble [1,474 x 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: int [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
Since we are looking at 4 separate ATMs and the data does not provide their location, they may have varying amounts of cash withdrawn at varying different days. We will work with each ATM separately and each will have its own forecast for the month of May in 2010.
atm1 <- atm %>%
filter(ATM == "ATM1") %>%
as_tsibble(index = DATE)
autoplot(atm1) +
labs(title="ATM1", subtitle="Cash withdrawals per day", y="Hundreds USD")
## Plot variable not specified, automatically selected `.vars = Cash`
atm2 <- atm %>%
filter(ATM == "ATM2") %>%
as_tsibble(index = DATE)
autoplot(atm2) +
labs(title="ATM2", subtitle="Cash withdrawals per day", y="Hundreds USD")
## Plot variable not specified, automatically selected `.vars = Cash`
atm3 <- atm %>%
filter(ATM == "ATM3") %>%
as_tsibble(index = DATE)
autoplot(atm3) +
labs(title="ATM3", subtitle="Cash withdrawals per day", y="Hundreds USD")
## Plot variable not specified, automatically selected `.vars = Cash`
atm4 <- atm %>%
filter(ATM == "ATM4") %>%
as_tsibble(index = DATE)
autoplot(atm4) +
labs(title="ATM4", subtitle="Cash withdrawals per day", y="Hundreds USD")
## Plot variable not specified, automatically selected `.vars = Cash`
Based on our plots above, we can observe that “ATM1” and “ATM2” are time series that have constant variability, with no apparent trend and potential seasonality, but we will explore these more in detail with decompositions. However, “ATM3”seems to only have withdrawals for the last few days in the data. Lastly, “ATM4” has a clear outlier, which is the one we identified previously with the summary() function.
which(atm4$Cash == 10919)
## [1] 285
The outlier is found on row 285 of the 4th ATM time series. We will use the average to impute this number as it is obvious that this is an error.
The time series looks more like “ATM1” and “ATM2” now:
atm4[285,3] <- round(mean(atm4$Cash),0)
autoplot(atm4) +
labs(title="ATM4", subtitle="Cash withdrawals per day", y="Hundreds USD")
## Plot variable not specified, automatically selected `.vars = Cash`
We also seem to have a few missing values on the “Cash” column for “ATM1” and “ATM2”:
sum(is.na(atm1$Cash))
## [1] 3
sum(is.na(atm2$Cash))
## [1] 2
sum(is.na(atm3$Cash))
## [1] 0
sum(is.na(atm4$Cash))
## [1] 0
The missing values are found on the rows below for each respective ATM
which(is.na(atm1$Cash))
## [1] 44 47 53
which(is.na(atm2$Cash))
## [1] 49 55
hist(atm1$Cash)
hist(atm2$Cash)
Since I don’t see an evident skewness on the distribution of “Cash” for either ATM above, I will use the averages to impute these values:
atm1[44,3] <- round(mean(atm1$Cash, na.rm=TRUE),0)
atm1[47,3] <- round(mean(atm1$Cash, na.rm=TRUE),0)
atm1[53,3] <- round(mean(atm1$Cash, na.rm=TRUE),0)
atm2[49,3] <- round(mean(atm2$Cash, na.rm=TRUE),0)
atm2[55,3] <- round(mean(atm2$Cash, na.rm=TRUE),0)
Additionally, we can look for a Box-Cox transformation for each series to help make them a little simpler:
lambda1 <- atm1%>%
features(Cash,features = guerrero)%>%
pull(lambda_guerrero)
plot_trans1 <- atm1 %>%
autoplot(box_cox(Cash, lambda1)) +
labs(title="ATM1 TRANSFORMED", subtitle="Cash withdrawals per day", y="USD")
lambda2 <- atm2%>%
features(Cash,features = guerrero)%>%
pull(lambda_guerrero)
plot_trans2 <- atm2 %>%
autoplot(box_cox(Cash, lambda2)) +
labs(title="ATM2 TRANSFORMED", subtitle="Cash withdrawals per day", y="USD")
lambda3 <- atm3%>%
features(Cash,features = guerrero)%>%
pull(lambda_guerrero)
## Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
## max(.period, : NA/Inf replaced by maximum positive value
## Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
## max(.period, : NA/Inf replaced by maximum positive value
plot_trans3 <- atm3 %>%
autoplot(box_cox(Cash, lambda3)) +
labs(title="ATM3 TRANSFORMED", subtitle="Cash withdrawals per day", y="USD")
lambda4 <- atm4%>%
features(Cash,features = guerrero)%>%
pull(lambda_guerrero)
plot_trans4 <- atm4 %>%
autoplot(box_cox(Cash, lambda4)) +
labs(title="ATM4 TRANSFORMED", subtitle="Cash withdrawals per day", y="USD")
grid.arrange(plot_trans1, plot_trans2, plot_trans3, plot_trans4, nrow = 2)
The transformations helped scale down the time series of all ATMs.
Let’s now look at the decomposition of each series to see if we have strong seasonality and perhaps differencing is required in the model. Since the magnitud of the seasonal components do not seem to change with time, we can say the series are additive.
atm1%>%
model(classical_decomposition(box_cox(Cash, lambda1), type="additive")) %>%
components () %>%
autoplot() +
labs(title="Classical additive decomposition of ATM1")
## Warning: Removed 3 row(s) containing missing values (geom_path).
atm2%>%
model(classical_decomposition(box_cox(Cash, lambda2), type="additive")) %>%
components () %>%
autoplot() +
labs(title="Classical additive decomposition of ATM2")
## Warning: Removed 3 row(s) containing missing values (geom_path).
atm3%>%
model(classical_decomposition(box_cox(Cash, lambda3), type="additive")) %>%
components () %>%
autoplot() +
labs(title="Classical additive decomposition of ATM3")
## Warning: Removed 3 row(s) containing missing values (geom_path).
atm4%>%
model(classical_decomposition(box_cox(Cash, lambda4), type="additive")) %>%
components () %>%
autoplot() +
labs(title="Classical additive decomposition of ATM4")
## Warning: Removed 3 row(s) containing missing values (geom_path).
As evident on the plots above, we see a strong seasonal component for all ATMs.
plot_acf1 <- atm1 %>%
ACF(box_cox(Cash, lambda1)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM1")
plot_acf2 <- atm2 %>%
ACF(box_cox(Cash, lambda2)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM2")
plot_acf3 <- atm3 %>%
ACF(box_cox(Cash, lambda3)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM3")
plot_acf4 <- atm4 %>%
ACF(box_cox(Cash, lambda4)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM4")
grid.arrange(plot_acf1, plot_acf2, plot_acf3, plot_acf4, nrow = 2)
As observed in the ACF plots above, we may need to apply unitroot_nsdiffs() to the daily cash withdrawals for each ATM in order to determine if we need any seasonal differencing by week.
atm1 %>%
features(box_cox(Cash, lambda1), unitroot_nsdiffs)
## # A tibble: 1 x 1
## nsdiffs
## <int>
## 1 1
atm2 %>%
features(box_cox(Cash, lambda2), unitroot_nsdiffs)
## # A tibble: 1 x 1
## nsdiffs
## <int>
## 1 1
atm3 %>%
features(box_cox(Cash, lambda3), unitroot_nsdiffs)
## # A tibble: 1 x 1
## nsdiffs
## <int>
## 1 0
atm4 %>%
features(box_cox(Cash, lambda4), unitroot_nsdiffs)
## # A tibble: 1 x 1
## nsdiffs
## <int>
## 1 0
As determined by the function above, we need to apply seasonal differencing to “ATM1” and “ATM2”. Let’s explore further to see if we need any additional differencing:
atm1 %>%
features(difference(box_cox(Cash, lambda1), 7), unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 0
atm2 %>%
features(difference(box_cox(Cash, lambda2), 7), unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 0
atm3 %>%
features(box_cox(Cash, lambda3), unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 0
atm4 %>%
features(box_cox(Cash, lambda4), unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 0
No additional differencing seems to be needed. Let’s take a look at the ACF plots aftering differencing “ATM1” and “ATM2”.
atm1 %>%
ACF(difference(box_cox(Cash, lambda1), 7)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM1")
atm2 %>%
ACF(difference(box_cox(Cash, lambda2), 7)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM2")
Differencing seems to have made the data look closer to white noise.
Considering that these data need differencing, we will use the ARIMA() model, which applies differencing within the algorithm, making it simpler to build.
For “ATM3” we will use the naive model, which takes the last observation to forecast. Given there are only three values, this is a sound approach.
atm1_fit <- atm1 %>%
model(ARIMA(box_cox(Cash, lambda1)))
report(atm1_fit)
## Series: Cash
## Model: ARIMA(0,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda1)
##
## Coefficients:
## ma1 ma2 sma1
## 0.1090 -0.1121 -0.6410
## s.e. 0.0524 0.0525 0.0433
##
## sigma^2 estimated as 3.052: log likelihood=-708.06
## AIC=1424.13 AICc=1424.24 BIC=1439.65
atm1_fit %>%
gg_tsresiduals()
atm2_fit <- atm2 %>%
model(ARIMA(box_cox(Cash, lambda2)))
report(atm2_fit)
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda2)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4234 -0.9054 0.4669 0.7984 -0.7203
## s.e. 0.0594 0.0448 0.0895 0.0592 0.0414
##
## sigma^2 estimated as 48.4: log likelihood=-1201.99
## AIC=2415.98 AICc=2416.22 BIC=2439.26
atm2_fit %>%
gg_tsresiduals()
atm3_fit <- atm3 %>%
model(NAIVE(box_cox(Cash, lambda3)))
report(atm3_fit)
## Series: Cash
## Model: NAIVE
## Transformation: box_cox(Cash, lambda3)
##
## sigma^2: 1381.2177
atm3_fit %>%
gg_tsresiduals()
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
atm4_fit <- atm4 %>%
model(ARIMA(box_cox(Cash, lambda4)))
report(atm4_fit)
## Series: Cash
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean
## Transformation: box_cox(Cash, lambda4)
##
## Coefficients:
## ma1 sar1 sar2 constant
## 0.0753 0.2146 0.2044 13.7047
## s.e. 0.0529 0.0516 0.0525 0.5685
##
## sigma^2 estimated as 107.4: log likelihood=-1369.93
## AIC=2749.86 AICc=2750.03 BIC=2769.36
atm4_fit %>%
gg_tsresiduals()
All residuals, except for “ATM3”, look like they have constant variability, seem to be white noise and have approximately close to normal distributions.
We will take a look at the forecasts for each ATM:
plot_fc1 <- atm1_fit %>% forecast(h=31) %>%
autoplot(atm1)
plot_fc2 <- atm2_fit %>% forecast(h=31) %>%
autoplot(atm2)
plot_fc3 <- atm3_fit %>% forecast(h=31) %>%
autoplot(atm3)
plot_fc4 <- atm4_fit %>% forecast(h=31) %>%
autoplot(atm4)
grid.arrange(plot_fc1, plot_fc2, plot_fc3, plot_fc4, nrow = 2)
All forecasts seem reasonable from a visual perspective.
Finally, we save the forecast of each ATM in its own .csv document:
forecast_atm1 <- atm1_fit %>% forecast(h=31)
forecast_atm2 <- atm2_fit %>% forecast(h=31)
forecast_atm3 <- atm3_fit %>% forecast(h=31)
forecast_atm4 <- atm4_fit %>% forecast(h=31)
write.csv(forecast_atm1,"forecast_atm1.csv")
write.csv(forecast_atm2,"forecast_atm2.csv")
write.csv(forecast_atm3,"forecast_atm3.csv")
write.csv(forecast_atm4,"forecast_atm4.csv")
We will first load the data, which I have saved in a local folder, and explore it.
data2 <- read_excel("ResidentialCustomerForecastLoad-624.xlsx", col_names = T)
The data contain the variables “CaseSequence, YYYY-MMM and KWH”.
head(data2)
## # A tibble: 6 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
As we can observe in the table above, the variable “YYYY-MMM” is of data type “chr”. We will have to convert it to “date” in order to work with the data.
summary(data2)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
We can also see that the max value under the “KWH” variable is 10655730 and min value is 770523. We’ll have to explore the data some more to determine weather either of these values are outliers.
plot_missing(data2)
There are a few missing values for “KWH” that we may need to address.
We will create a separate data set in order to maintain the original data and make all the necessary transformations there. First we’ll transform our data into a time series, while making the variable “YYYY-MMM” the index.
power <- data2 %>%
mutate (Month = yearmonth(`YYYY-MMM`)) %>%
select(-`YYYY-MMM`, -CaseSequence) %>%
as_tsibble(index = Month)
str(power)
## tbl_ts [192 x 2] (S3: tbl_ts/tbl_df/tbl/data.frame)
## $ KWH : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
## $ Month: mth [1:192] 1998 Jan, 1998 Feb, 1998 Mar, 1998 Apr, 1998 May, 1998 Jun...
## - attr(*, "key")= tibble [1 x 1] (S3: tbl_df/tbl/data.frame)
## ..$ .rows: list<int> [1:1]
## .. ..$ : int [1:192] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..@ ptype: int(0)
## - attr(*, "index")= chr "Month"
## ..- attr(*, "ordered")= logi TRUE
## - attr(*, "index2")= chr "Month"
## - attr(*, "interval")= interval [1:1] 1M
## ..@ .regular: logi TRUE
autoplot(power) +
labs(title="Power Consumption", subtitle="Residential", y="KWH")
## Plot variable not specified, automatically selected `.vars = KWH`
In the graph above, we can clearly see there is an outlier and could potentially be an error.
which(power$KWH == 770523)
## [1] 151
This value belong to observation 151.
Let’s also look for the missing value or values previously noted:
which(is.na(power$KWH))
## [1] 129
There is one value missing from observation 129.
hist(power$KWH)
Since the distribution of “KWH” is close to normal, I will use the average to substitute these values.
power[129,1] <- round(mean(power$KWH, na.rm=TRUE),0)
power[151,1] <- round(mean(power$KWH, na.rm=TRUE),0)
Additionally, we can look for a Box-Cox transformation for the series to help simplify it:
lambdap <- power%>%
features(KWH,features = guerrero)%>%
pull(lambda_guerrero)
lambdap
## [1] -0.2108159
The guerrero() feature suggests a transformation with lambda -0.21.
power %>%
autoplot(box_cox(KWH, lambdap)) +
labs(title="Power Consumption Transformed", subtitle="Residential", y="KWH")
The transformation helped scale down the series but did not do much to normalize it.
Let’s now look at the decomposition of the series to see if we have strong seasonality and perhaps differencing is required in the model. Since the magnitud of the seasonal components do not seem to change with time, we can say the series are additive.
power%>%
model(classical_decomposition(box_cox(KWH, lambdap), type="additive")) %>%
components () %>%
autoplot() +
labs(title="Classical additive decomposition of Power Consumption")
## Warning: Removed 6 row(s) containing missing values (geom_path).
As evident on the plots above, we see a strong seasonal component and a slight trend for these data.
power %>%
ACF(box_cox(KWH, lambdap)) %>%
autoplot() + labs(title="Autocorrelation of Power Consumption")
We can observe on the ACF plot that we have a lot of autocorrelation. We may need to apply unitroot_nsdiffs() to the power consumption in order to determine if we need any seasonal differencing.
power %>%
features(box_cox(KWH, lambdap), unitroot_nsdiffs)
## # A tibble: 1 x 1
## nsdiffs
## <int>
## 1 1
As determined by the function above, we need to apply one order of seasonal differencing to the power consumption data. Let’s explore further to see if we need any additional differencing:
power %>%
features(difference(box_cox(KWH, lambdap), 12), unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 0
No additional differencing seems to be needed. Let’s take a look at the ACF plot aftering differencing the data.
power %>%
ACF(difference(box_cox(KWH, lambdap), 12)) %>%
autoplot() + labs(title="Autocorrelation of Power Consumption Differenced")
The data looks closer to white noise after differencing.
Considering that these data need seasonal differencing, again we will use the ARIMA() model, which applies differencing within the algorithm, making it simpler to build and actually a pretty straight forward choice.
power_fit <- power %>%
model(ARIMA(box_cox(KWH, lambdap)))
report(power_fit)
## Series: KWH
## Model: ARIMA(1,0,0)(2,1,0)[12] w/ drift
## Transformation: box_cox(KWH, lambdap)
##
## Coefficients:
## ar1 sar1 sar2 constant
## 0.2552 -0.7405 -0.3798 9e-04
## s.e. 0.0749 0.0736 0.0752 3e-04
##
## sigma^2 estimated as 1.309e-05: log likelihood=764.92
## AIC=-1519.85 AICc=-1519.5 BIC=-1503.88
power_fit %>%
gg_tsresiduals()
The ARIMA() function picks an ARIMA(1,0,0)(2,1,0)[12] with drift. The residuals’ distribution looks close to normal, they have constant variability and we can say they are white noise.
We will take a look at the forecasts for power consumption for the year 2014:
power_fit %>% forecast(h=12) %>%
autoplot(power)
The forecast seems reasonable from a visual perspective.
Finally, we save the forecast in a .csv document:
forecast_power <- power_fit %>% forecast(h=12)
write.csv(forecast_power,"forecast_power.csv")