This project consists of 3 parts - two required and one bonus and is worth 15% of your grade.
The project is due at 11:59 PM on Sunday Apr 11.
I will accept late submissions with a penalty until the meetup after that when we review some projects.
Part A – ATM Forecast, ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010.
The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward.
I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast.
I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.1 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.3 ──
## ✓ broom 0.7.8 ✓ rsample 0.1.0
## ✓ dials 0.0.9 ✓ tune 0.1.5
## ✓ infer 0.5.4 ✓ workflows 0.2.2
## ✓ modeldata 0.1.1 ✓ workflowsets 0.0.2
## ✓ parsnip 0.1.6 ✓ yardstick 0.0.8
## ✓ recipes 0.1.16
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(readxl)
library(tsibble)
##
## Attaching package: 'tsibble'
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'forecast'
## The following object is masked from 'package:yardstick':
##
## accuracy
library(fabletools)
##
## Attaching package: 'fabletools'
## The following objects are masked from 'package:forecast':
##
## accuracy, forecast
## The following object is masked from 'package:yardstick':
##
## accuracy
## The following object is masked from 'package:parsnip':
##
## null_model
## The following object is masked from 'package:infer':
##
## generate
library(fable)
library(feasts)
# I first save the excel file to my desktop and then use the readxl package to read it into R
atm_df <- read_excel('~/Desktop/ATM624Data.xlsx')
# I inspect the class. It's not yet a tsibble
class(atm_df)
## [1] "tbl_df" "tbl" "data.frame"
# I change the date. I had expected the origin to be 1900-01-01 but it's two days off; I don't know why.
atm_df$DATE <- as.Date(atm_df$DATE, origin = "1899-12-30")
# Glimpse the data. Only three variables.
glimpse(atm_df)
## Rows: 1,474
## Columns: 3
## $ DATE <date> 2009-05-01, 2009-05-01, 2009-05-02, 2009-05-02, 2009-05-03, 2009…
## $ ATM <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
## $ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …
# need to separate the key and index, otherwise we get an error message for duplicate dates.
# A tsibble needs to only have one date
atm_df <- as_tsibble(atm_df, key = ATM, index = "DATE")
# let's try a first autopolot
autoplot(atm_df)
## Plot variable not specified, automatically selected `.vars = Cash`
## Warning: Removed 14 row(s) containing missing values (geom_path).
# there's a crazy huge spike in ATM4. That's interesting.
any(is.na(atm_df)) # This evaluates to TRUE - there are some NA values. This could be a problem.
## [1] TRUE
any(is.na(atm_df$DATE)) # There are no NA in the Date column, which is good. But there still might be a missing date, so we should check this. Although I probably couldn't have turned this into a tsibble if there were a missing date.
## [1] FALSE
any(is.na(atm_df$ATM)) # NA values exist in the ATM data section.
## [1] TRUE
# We actually see that the missing values exist in the most recent 15 days. So we can just remove the most recent 15 days. Probably, data hadn't been updated or recorded at the time this data was gathered.
atm_df <- atm_df %>%
filter(!is.na(ATM))
any(is.na(atm_df$Cash)) # NA values also exist in the Cash data section.
## [1] TRUE
# We are also missing Cash values for ATM1 and ATM2.
# because the distribution is kind of bi-modal (lots of zero values) I am choosing to impute with the median value
ggplot(atm_df, aes(x = Cash)) + geom_histogram() + facet_wrap(~ATM, scales = 'free')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5 rows containing non-finite values (stat_bin).
atm_df$Cash[(is.na(atm_df$Cash) & atm_df$ATM == "ATM1")] <- median(atm_df$Cash[atm_df$ATM == "ATM1"], na.rm = TRUE)
atm_df$Cash[(is.na(atm_df$Cash) & atm_df$ATM == "ATM2")] <- median(atm_df$Cash[atm_df$ATM == "ATM2"], na.rm = TRUE)
atm_df %>%
autoplot() +
facet_wrap(~ATM, scales = 'free') +
labs(title = "ATM", subtitle = "data") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "Data 624 Project 1")
## Plot variable not specified, automatically selected `.vars = Cash`
# It looks like ATM1 and ATM2 have some trend and seasonality
# Not yet sure what is going on with ATM3 and ATM4
atm_2_df <- atm_df %>%
filter(ATM == "ATM2")
atm_df %>%
filter(ATM == "ATM1") %>%
autoplot() +
facet_wrap(~ATM, scales = 'free') +
labs(title = "ATM1 Cash", subtitle = "Data without Forecast") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "Data 624 Project 1")
## Plot variable not specified, automatically selected `.vars = Cash`
fit_atm1 <- atm_df %>%
filter(ATM == "ATM1") %>%
model(
additive = ETS(Cash ~ error("A") + trend("A") +
season("A")),
multiplicative = ETS(Cash ~ error("M") + trend("A") +
season("M"))
)
fc <- fit_atm1 %>%
forecast(h = 30)
fit_atm1 %>% report()
## Warning in report.mdl_df(.): Model reporting is only supported for individual
## models, so a glance will be shown. To see the report for a specific model, use
## `select()` and `filter()` to identify a single model.
## # A tibble: 2 x 10
## ATM .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 additive 584. -2234. 4492. 4492. 4538. 567. 570. 15.1
## 2 ATM1 multiplicative 0.132 -2274. 4571. 4572. 4618. 701. 716. 0.217
# the report function tells us that alpha = and gamma =
fc %>%
autoplot(atm_df) +
geom_line(aes(y = .fitted), data = augment(fit_atm1)) +
labs(title = "ATM1 Cash", subtitle = "Data with Forecasts") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "Data 624 Project 1")
atm_df1 <- atm_df %>%
filter(DATE >= "2010-02-01" & ATM == "ATM1")
fc %>%
autoplot(atm_df1) +
# geom_line(aes(y = .fitted), data = augment(fit_atm1)) +
labs(title = "ATM1 Cash", subtitle = "Data with Forecasts") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "Data 624 Project 1")
# the additive and multiplicative models are best because they account for seasonality
############################################################
############################################################
atm_df %>%
filter(ATM == "ATM2") %>%
autoplot() +
facet_wrap(~ATM, scales = 'free') +
labs(title = "ATM2 Cash", subtitle = "Data without Forecast") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "Data 624 Project 1")
## Plot variable not specified, automatically selected `.vars = Cash`
fit_atm2 <- atm_df %>%
filter(ATM == "ATM2") %>%
model(
additive = ETS(Cash ~ error("A") + trend("A") +
season("A")),
multiplicative = ETS(Cash ~ error("M") + trend("A") +
season("M"))
)
fc <- fit_atm2 %>%
forecast(h = 30)
fit_atm2 %>% report()
## Warning in report.mdl_df(.): Model reporting is only supported for individual
## models, so a glance will be shown. To see the report for a specific model, use
## `select()` and `filter()` to identify a single model.
## # A tibble: 2 x 10
## ATM .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM2 additive 651. -2253. 4531. 4532. 4577. 631. 633. 17.8
## 2 ATM2 multiplicative 0.375 -2379. 4781. 4782. 4828. 1211. 1211. 0.493
# the report function tells us that alpha = and gamma =
fc %>%
autoplot(atm_df) +
geom_line(aes(y = .fitted), data = augment(fit_atm1)) +
labs(title = "ATM2 Cash", subtitle = "Data with Forecasts") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "Data 624 Project 1")
atm_df2 <- atm_df %>%
filter(DATE >= "2010-02-01" & ATM == "ATM2")
fc %>%
autoplot(atm_df2) +
# geom_line(aes(y = .fitted), data = augment(fit_atm1)) +
labs(title = "ATM2 Cash", subtitle = "Data with Forecasts") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "Data 624 Project 1")
# For ATM2 the data is a little more different between the additive and the multiplicative models.
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple data set of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
## # A tibble: 192 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
## 7 739 1998-Jul 8914755
## 8 740 1998-Aug 8607428
## 9 741 1998-Sep 6989888
## 10 742 1998-Oct 6345620
## # … with 182 more rows
## [1] "tbl_df" "tbl" "data.frame"
## Rows: 192
## Columns: 3
## $ CaseSequence <dbl> 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 74…
## $ `YYYY-MMM` <chr> "1998-Jan", "1998-Feb", "1998-Mar", "1998-Apr", "1998-May…
## $ KWH <dbl> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 891…
## [1] TRUE
## Warning in report.mdl_df(.): Model reporting is only supported for individual
## models, so a glance will be shown. To see the report for a specific model, use
## `select()` and `filter()` to identify a single model.
## # A tibble: 2 x 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 additive 7.48e+11 -3121. 6276. 6280. 6332. 6.86e11 7.00e11 5.21e+5
## 2 multiplica… 1.42e- 2 -3096. 6227. 6231. 6282. 6.96e11 7.23e11 7.55e-2
Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.