Forecast how much cash is taken out of 4 different ATM machines for
May 2010. The data is given in a single file. The variable
Cash is provided in hundreds of dollars. Explain and
demonstrate your process, techniques used and not used, and your actual
forecast. Provide your written report on your findings, visuals,
discussion and your R code via an RPubs link along with the actual .rmd
file.
First, let’s load and take a look at the data.
atm_data <- read.csv("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_624/Projects/data/ATM624Data.csv", na.strings = c('', ' '))
# glimpse preliminary information
glimpse(atm_data)
## Rows: 1,474
## Columns: 3
## $ DATE <chr> "5/1/2009 12:00:00 AM", "5/1/2009 12:00:00 AM", "5/2/2009 12:00:0…
## $ ATM <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
## $ Cash <int> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …
Date is incorrectly coded as a character. This should be
converted into a datetime. We will also cast ATM as a
factor and turn the dataframe into a tsibble.
atm_data <- atm_data |>
mutate(DATE = mdy_hms(DATE),
DATE = as_date(DATE),
ATM = as.factor(ATM)) |>
as_tsibble(index=DATE, key=ATM)
# check values
summary(atm_data)
## DATE ATM Cash
## Min. :2009-05-01 ATM1:365 Min. : 0.0
## 1st Qu.:2009-08-01 ATM2:365 1st Qu.: 0.5
## Median :2009-11-01 ATM3:365 Median : 73.0
## Mean :2009-10-31 ATM4:365 Mean : 155.6
## 3rd Qu.:2010-02-01 NA's: 14 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10920.0
## NA's :19
The ATM information is missing from 14 entries and the
Cash information is missing from 19.
Let’s take a look at the entries with missing ATM
values.
atm_data |>
filter(is.na(ATM)) |>
knitr::kable()
| DATE | ATM | Cash |
|---|---|---|
| 2010-05-01 | NA | NA |
| 2010-05-02 | NA | NA |
| 2010-05-03 | NA | NA |
| 2010-05-04 | NA | NA |
| 2010-05-05 | NA | NA |
| 2010-05-06 | NA | NA |
| 2010-05-07 | NA | NA |
| 2010-05-08 | NA | NA |
| 2010-05-09 | NA | NA |
| 2010-05-10 | NA | NA |
| 2010-05-11 | NA | NA |
| 2010-05-12 | NA | NA |
| 2010-05-13 | NA | NA |
| 2010-05-14 | NA | NA |
The missing ATM entries all correspond to missing
Cash entries for the first half of May. This is the month
we are attempting to forecast. We can filter out these entries.
atm_data <- atm_data |>
filter(!is.na(ATM))
Let’s take a look at the remaining missing Cash
entries.
atm_data |>
filter(is.na(Cash)) |>
knitr::kable()
| DATE | ATM | Cash |
|---|---|---|
| 2009-06-13 | ATM1 | NA |
| 2009-06-16 | ATM1 | NA |
| 2009-06-22 | ATM1 | NA |
| 2009-06-18 | ATM2 | NA |
| 2009-06-24 | ATM2 | NA |
There are three missing values for ATM1 and two missing values for ATM2.
Let’s take a look at the trends for these ATMs as they stand now.
atm_data |>
autoplot(Cash) +
labs(title = "ATM Withdrawal for Four ATMs (May 2009 - April 2010)", x = "Date")
ATM4 seems to have almost consistently the highest total withdraw amount. There is also a large outlier for this ATM somewhere between February and March of 2010. Let’s plot each ATM on their own scale so that we can see the trend for each.
atm_data |>
autoplot(Cash) +
facet_wrap(~ATM, ncol=1, scales = "free_y") +
labs(title = "ATM Withdrawal for Four ATMs (May 2009 - April 2010)", x = "Date")
ATM4 has a large outlier. For ATM3, it seems all the values are 0 until the end of April 2010. Possible explanations for this could be that this ATM was out of order for the period up until this point or perhaps the ATM only began operation at this point. Another possibility is that there could have been withdrawals from this ATM but, due to rounding from measuring in hundreds, the total amount recorded does not accurately reflect the actual withdrawals (eg. if the total withdrawal for a day was $40, the amount in hundreds would round to 0). This explanation, however, is probably unlikely. Whatever the reason, the data for this ATM will not yield meaningful forecasts as the results will be heavily skewed by the $0 withdrawal amount.
Let’s take a look at the distributions for the other three ATMs.
plot1 <- atm_data |>
filter(ATM != "ATM3") |>
ggplot(aes(x=Cash)) +
geom_histogram(aes(fill=ATM), bins=30) +
facet_wrap(~ATM, nrow=1, scales = "free_x") +
theme(legend.position="none") +
labs(title = "Histograms of ATM Cash Withdrawals", y = "Count", x = "Cash (Hundreds of $)")
plot2 <- atm_data |>
filter(ATM != "ATM3") |>
ggplot(aes(x=Cash)) +
geom_boxplot(aes(fill=ATM)) +
facet_wrap(~ATM, nrow=1, scales = "free_x") +
theme(legend.position="none") +
labs(title = "Boxplots of ATM Cash Withdrawals", y = "", x = "Cash (Hundreds of $)")
plot_grid(plot1, plot2, ncol=1)
Here we can see the extreme outlier in ATM4 that we saw in the line plot. We can also see some outliers for ATM1.
ATM1 is missing 3 values for Cash (about 0.8% of the
data). The data appears to be bimodally distributed with more
observations present in the peak on the right. There also appears to be
some outliers present towards the beginning and end of the distribution.
As the data appears mostly normally distributed without these outliers,
as can be seen from the boxplot, we will use median imputation for this
ATM.
ATM2 is missing 2 values (about 0.5% of the data). ATM2 appears to be bimodally distributed with about equal observations in each peak. We will use mode imputation for this ATM.
We will use median imputation to replace the large outlier in ATM4.
# function for mode
mode <- function(x) {
uniq_x <- unique(x)
uniq_x[which.max(tabulate(match(x, uniq_x)))]
}
atm_data_imputed <- atm_data |>
mutate(Cash = ifelse(Cash == max(atm_data$Cash, na.rm = T), NA, Cash)) |>
group_by(ATM) |>
mutate(Cash = case_when(
is.na(Cash) & ATM %in% c("ATM1", "ATM4") ~ median(Cash, na.rm=T),
is.na(Cash) & ATM == "ATM2" ~ mode(Cash),
TRUE ~ Cash
))
Let’s take a look at the data now that the values have been imputed.
atm_data_imputed |>
autoplot(Cash) +
facet_wrap(~ATM, ncol=1, scales = "free_y") +
labs(title = "ATM Withdrawal for Four ATMs (May 2009 - April 2010)", x = "Date")
atm_data_imputed |>
ACF(Cash) |>
autoplot() +
facet_wrap(~ATM, ncol=1, scales = "free_y") +
labs(title = "ACF Plots", x = "Date")
From a preliminary look at the data, it is hard to tell whether the trends are stationary or not. ATM1 and ATM2 appear to have some seasonality apparent from the significant lags at weekly intervals (7, 14, 21) in the ACF plots. These ATMs might not be stationary. We can use a unit root test to determine whether the data is stationary or not.
atm_data_imputed |>
features(Cash, unitroot_kpss) |>
knitr::kable()
| ATM | kpss_stat | kpss_pvalue |
|---|---|---|
| ATM1 | 0.4651358 | 0.0495190 |
| ATM2 | 1.9660900 | 0.0100000 |
| ATM3 | 0.3891606 | 0.0818273 |
| ATM4 | 0.1181178 | 0.1000000 |
ATM1 and ATM2 are not stationary while ATM3 and ATM4 are.
We can see how many orders of differencing must be done to make the data stationary.
atm_data_imputed |>
features(Cash, unitroot_ndiffs)
## # A tibble: 4 × 2
## ATM ndiffs
## <fct> <int>
## 1 ATM1 1
## 2 ATM2 1
## 3 ATM3 0
## 4 ATM4 0
ATM1 and ATM2 both require seasonal differencing to make the data stationary.
atm_data_imputed |>
features(difference(Cash,7), unitroot_ndiffs)
## # A tibble: 4 × 2
## ATM ndiffs
## <fct> <int>
## 1 ATM1 0
## 2 ATM2 0
## 3 ATM3 0
## 4 ATM4 0
No first order differencing is needed.
atm_data_imputed |>
filter(ATM == "ATM1") |>
gg_tsdisplay(difference(Cash, lag=7),
plot_type="partial") +
labs(title = "ATM1: Seasonally Differenced")
We can see the seasonal lags in the PACF plot. There is one outstanding seasonal lag in the ACF plot at lag 7, indicating a possible seasonal MA(1) model. The spike at lag 1 in the ACF plot could also indicate a non-seasonal MA(1).
atm_data_imputed |>
filter(ATM == "ATM2") |>
gg_tsdisplay(difference(Cash, lag=7),
plot_type="partial") +
labs(title = "ATM2: Seasonally Differenced")
Once again, we can see the seasonal spikes in the PACF plot. There are significant spiked in the ACF plot at lag 7 indicating a seasonal MA(1) model. There may also be a significant spike at lag 21 which could indicate an MA(2) model.
atm_data_imputed |>
filter(ATM == "ATM4") |>
gg_tsdisplay(Cash,
plot_type="partial") +
labs(title = "ATM4")
We can see the seasonal spikes decaying in the ACF plot. There is an outstanding seasonal spike at lag 7 in the PACF plot indicating a seasonal AR(1) model.
We can use seasonal ARIMA modeling to forecast future withdrawals for each ATM.
Let’s use the April 2010 data as a validation set and the rest of the data for training.
We will use an automatically generated ARIMA model and compare with our assumptions from the ACF and PACF plots.
atm1 <- atm_data_imputed |>
filter(ATM == "ATM1")
atm1_mod <- atm1 |>
model(auto = ARIMA(Cash),
arima001011 = ARIMA(Cash ~ 0 + pdq(0,0,1) + PDQ(0,1,1, period=7)))
atm1_mod |>
knitr::kable()
| ATM | auto | arima001011 |
|---|---|---|
| ATM1 | <ARIMA(0,0,1)(0,1,2)[7]> | <ARIMA(0,0,1)(0,1,1)[7]> |
report(atm1_mod) |>
select(.model:BIC) |>
knitr::kable()
| .model | sigma2 | log_lik | AIC | AICc | BIC |
|---|---|---|---|---|---|
| auto | 559.7744 | -1641.115 | 3290.231 | 3290.344 | 3305.753 |
| arima001011 | 564.7218 | -1643.061 | 3292.121 | 3292.189 | 3303.763 |
atm2 <- atm_data_imputed |>
filter(ATM == "ATM2")
atm2_mod <- atm2 |>
model(auto = ARIMA(Cash),
arima000011 = ARIMA(Cash ~ 0 + pdq(0,0,0) + PDQ(0,1,1, period=7)),
arima000012 = ARIMA(Cash ~ 0 + pdq(0,0,0) + PDQ(0,1,2, period=7)))
atm2_mod |>
knitr::kable()
| ATM | auto | arima000011 | arima000012 |
|---|---|---|---|
| ATM2 | <ARIMA(2,0,2)(0,1,1)[7]> | <ARIMA(0,0,0)(0,1,1)[7]> | <ARIMA(0,0,0)(0,1,2)[7]> |
report(atm2_mod) |>
select(.model:BIC) |>
knitr::kable()
| .model | sigma2 | log_lik | AIC | AICc | BIC |
|---|---|---|---|---|---|
| auto | 588.0783 | -1649.266 | 3310.531 | 3310.771 | 3333.814 |
| arima000011 | 620.7961 | -1660.382 | 3324.764 | 3324.798 | 3332.525 |
| arima000012 | 622.0567 | -1660.237 | 3326.473 | 3326.541 | 3338.115 |
atm4 <- atm_data_imputed |>
filter(ATM == "ATM4")
atm4_mod <- atm4 |>
model(auto = ARIMA(Cash),
arima000100 = ARIMA(Cash ~ 0 + pdq(0,0,0) + PDQ(1,0,0, period=7)))
atm4_mod |>
knitr::kable()
| ATM | auto | arima000100 |
|---|---|---|
| ATM4 | <ARIMA(3,0,2)(1,0,0)[7] w/ mean> | <ARIMA(0,0,0)(1,0,0)[7]> |
report(atm4_mod) |>
select(.model:BIC) |>
knitr::kable()
| .model | sigma2 | log_lik | AIC | AICc | BIC |
|---|---|---|---|---|---|
| auto | 117510.3 | -2645.177 | 5306.353 | 5306.758 | 5337.553 |
| arima000100 | 171786.4 | -2719.433 | 5442.865 | 5442.898 | 5450.665 |
In each case, the automatically generated ARIMA model for the ATM has a lower AIC than the manually created ones. For ATM1, the AIC for the manually created ARIMA model is only slightly larger than the AIC of the automatically generated model and the BIC is lower.
Let’s forecast using the automatically generated models.
atm_train <- atm_data_imputed |>
filter(DATE < "2010-04-01")
atms_mod <- atm_train |>
filter(ATM != "ATM3") |>
model(ARIMA(Cash))
fc <- atms_mod |>
forecast(h=61)
There is not enough information from ATM3 to accurately forecast future withdrawals, as there are only 3 recorded days. For this, we will use the NAIVE method to forecast.
atm3 <- atm_data_imputed |>
filter(ATM == "ATM3")
atm3_mod <- atm3 |>
model(NAIVE(Cash))
atm3_fc <- atm3_mod |>
forecast(h=31)
fc <- fc |>
bind_rows(atm3_fc)
fc |>
autoplot(atm_data_imputed)
The models for ATM1 and ATM2 seem to somewhat accurately forecast future withdrawals. The model for ATM4 combines the ARIMA model with a mean and does not as accurately forecast.
atms_mod |>
filter(ATM == "ATM1") |>
gg_tsresiduals() +
labs(title = "ATM1")
atms_mod |>
filter(ATM == "ATM2") |>
gg_tsresiduals() +
labs(title = "ATM2")
atms_mod |>
filter(ATM == "ATM4") |>
gg_tsresiduals() +
labs(title = "ATM4")
The residuals of the models for ATM1 and ATM2 closely resemble the white noise series.
may_forecasts <- fc |>
filter(DATE >= "2010-05-01")
wb = createWorkbook()
sh = addWorksheet(wb, "ATM Forecasts")
xl_write(may_forecasts, wb, sh)
saveWorkbook(wb, "C:/Users/Shoshana/Documents/CUNY SPS/cuny-sps/DATA_624/Projects/output/project1_forecasts.xlsx", overwrite = TRUE)
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Let’s load the data.
power_data <- read.csv('https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_624/Projects/data/ResidentialCustomerForecastLoad-624.csv')
# glimpse preliminary information
glimpse(power_data)
## Rows: 192
## Columns: 3
## $ CaseSequence <int> 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 74…
## $ YYYY.MMM <chr> "1998-Jan", "1998-Feb", "1998-Mar", "1998-Apr", "1998-May…
## $ KWH <int> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 891…
The data consists of 3 columns.
CaseSequence is an ID column and is unecessary for the
purposes of this analysis. It can be removed.
The date column (YYYY.MMM) is coded as a character. In
order to perform a time series analysis on this data, we will need to
format this as a date.
We will also turn this table into a tsibble.
power_data_ts <- power_data |>
rename(Date = YYYY.MMM) |>
mutate(Date = yearmonth(Date)) |>
select(-CaseSequence) |>
as_tsibble(index=Date)
Let’s check if there are any missing values.
colSums(is.na(power_data_ts))
## Date KWH
## 0 1
There is one missing value for KWH. Let’s see which date
is missing this recorded value.
power_data_ts |>
filter(is.na(KWH)) |>
knitr::kable()
| Date | KWH |
|---|---|
| 2008 Sep | NA |
September 2008 is missing the recorded KWH.
Let’s take a look at the distribution of KWH and the
trend of the data to address this value.
plot1 <- power_data_ts |>
autoplot(KWH) +
labs(title = "Lineplot of Power Usage", x = "Date") +
scale_y_continuous(labels=scales::comma)
plot2 <- power_data_ts |>
ggplot(aes(x = KWH)) +
geom_histogram(bins=15, fill='lightblue') +
scale_x_continuous(labels=scales::comma) +
labs(title = "Histogram of KWH")
plot3 <- power_data_ts |>
ggplot(aes(x = KWH)) +
geom_boxplot(fill='lightblue') +
scale_x_continuous(labels=scales::comma) +
labs(title = "Boxplot of KWH")
plot_grid(plot1, plot_grid(plot2, plot3, nrow=1), ncol=1)
The line plot shows seasonality in the data. It also shows a major outlier of sometime in 2010. We can also see this outlier present in the histogram and boxplot.
We will need to deal with the missing point as well as this outlier.
What is the outlier?
power_data_ts |>
filter(KWH < 3000000)
## # A tsibble: 1 x 2 [1M]
## Date KWH
## <mth> <int>
## 1 2010 Jul 770523
July 2010 is the outlier with only 770,052 KWH used in that month. Every other month used in the millions. This could have been a typo and maybe there is a missing number.
# remove outlier
outlier_ind <- which(power_data_ts$Date == yearmonth("2010 Jul"))
power_data_ts$KWH[outlier_ind] = NA
Let’s take a look at the lags for this variable to see how we should impute the missing values.
power_data_ts |>
gg_lag(KWH, geom="point", lags = 1:12)
Based on the lags, the data is most similar at yearly intervals.
power_data_ts |>
gg_season(KWH) +
scale_y_continuous(labels=scales::comma) +
labs(title = "Seasonal Decomposition of KWH", x = "Month")
The data follows a clear seasonal trend. There is more power usage in the beginning, middle, and end of the year.
To impute the missing values, I will use an average of the power usage at the same month for two years prior.
impute_jul <- power_data |>
filter(YYYY.MMM %in% c("2008-Jul", "2009-Jul"))
impute_sep <- power_data |>
filter(YYYY.MMM %in% c("2006-Sep", "2007-Sep"))
# impute
power_data_ts <- power_data_ts |>
mutate(KWH = ifelse(Date == yearmonth("2010 Jul"), mean(impute_jul$KWH), KWH),
KWH = ifelse(Date == yearmonth("2008 Sep"), mean(impute_sep$KWH), KWH))
power_data_ts |>
autoplot(KWH) +
labs(title = "Lineplot of Power Usage", x = "Date") +
scale_y_continuous(labels=scales::comma)
Due to the clear seasonality in the data, we will need to use seasonal differencing to make this data stationary.
power_data_ts |>
gg_tsdisplay(difference(KWH, lag=12),
plot_type="partial") +
labs(title = "Power Usage: Seasonally Differenced")
Let’s see if a further first order difference is required. The data is considered stationary after the seasonal differencing.
power_data_ts |>
features(difference(KWH, lag=12), unitroot_kpss) |>
knitr::kable()
| kpss_stat | kpss_pvalue |
|---|---|
| 0.1623246 | 0.1 |
No further differencing is required.
The ACF and PACF plots are quite similar for the seasonally differenced data. We can see an outstanding lag at lag 12 in both. We can use either a seasonal AR(1) model or a seasonal MA(1) model. There is a possible nonseasonal outstanding lag seen at lag 3 in both plots, so we can also try using a nonseasonal MA(1) model.
There is possibly a slight upward trend in the data, so we will also test a model with drift.
We will use 2012 and 2013 as validation sets.
power_mods <- power_data_ts |>
filter(year(Date) < 2012) |>
model(auto = ARIMA(KWH),
arima000011 = ARIMA(KWH ~ 0 + pdq(0,0,0) + PDQ(0,1,1, period=12)),
arima000110 = ARIMA(KWH ~ 0 + pdq(0,0,0) + PDQ(1,1,0, period=12)),
arima001011 = ARIMA(KWH ~ 0 + pdq(0,0,1) + PDQ(0,1,1, period=12)),
arima001110 = ARIMA(KWH ~ 0 + pdq(0,0,1) + PDQ(1,1,0, period=12)),
arima001011drift = ARIMA(KWH ~ 1 + pdq(0,0,1) + PDQ(0,1,1, period=12)),
arima001110drift = ARIMA(KWH ~ 1 + pdq(0,0,1) + PDQ(1,1,0, period=12)))
power_mods |>
knitr::kable()
| auto | arima000011 | arima000110 | arima001011 | arima001110 | arima001011drift | arima001110drift |
|---|---|---|---|---|---|---|
| <ARIMA(0,0,1)(2,1,0)[12] w/ drift> | <ARIMA(0,0,0)(0,1,1)[12]> | <ARIMA(0,0,0)(1,1,0)[12]> | <ARIMA(0,0,1)(0,1,1)[12]> | <ARIMA(0,0,1)(1,1,0)[12]> | <ARIMA(0,0,1)(0,1,1)[12] w/ drift> | <ARIMA(0,0,1)(1,1,0)[12] w/ drift> |
report(power_mods) |>
select(.model:BIC) |>
knitr::kable()
| .model | sigma2 | log_lik | AIC | AICc | BIC |
|---|---|---|---|---|---|
| auto | 358748449455 | -2298.737 | 4607.474 | 4607.874 | 4622.723 |
| arima000011 | 473934089563 | -2320.148 | 4644.297 | 4644.375 | 4650.396 |
| arima000110 | 491517415472 | -2322.441 | 4648.882 | 4648.961 | 4654.982 |
| arima001011 | 385874805250 | -2304.657 | 4615.313 | 4615.471 | 4624.463 |
| arima001110 | 429148017595 | -2311.241 | 4628.482 | 4628.640 | 4637.632 |
| arima001011drift | 356469559457 | -2299.458 | 4606.915 | 4607.180 | 4619.115 |
| arima001110drift | 418376486644 | -2308.873 | 4625.747 | 4626.011 | 4637.946 |
The ARIMA(0,0,1)(0,1,1)[12] model with drift has the lowest AIC and BIC.
power_mods |>
forecast(h="3 years") |>
autoplot(power_data_ts |> filter(year(Date) > 2010), level= NULL) +
labs(title = "KWH Forecast Models") +
scale_y_continuous(labels=scales::comma)
Each model closely follows the data, although the ARIMA(0,0,1)(1,1,0)[12] model and ARIMA(0,0,1)(1,1,0)[12] with drift both seem to overestimate in mid-2013.
Let’s compare the residuals of the automatically generated model with the ARIMA(0,0,1)(0,1,1)[12] model with drift.
power_mods |>
select(auto) |>
gg_tsresiduals() +
labs(title = "Auto ARIMA")
power_mods |>
select(arima001011drift) |>
gg_tsresiduals() +
labs(title = "ARIMA(0,0,1)(0,1,1)[12] with drift")
The ARIMA(0,0,1)(0,1,1)[12] model with drift resembles white noise slightly better than than automatically generated model. There is one lag that seems autocorrelated but it is much smaller than the autocorrelated lag in the automatically generated model. The distribution of the residuals looks normal and there is no heteroskedasticity in the variation.
We will forecast for 2014 using the ARIMA(0,0,1)(0,1,1)[12] model with drift.
fc_2014 <- power_mods |>
select(arima001011drift) |>
forecast(h="3 years") |>
filter(year(Date) == 2014)
sh = addWorksheet(wb, "Power Forecasts")
xl_write(fc_2014, wb, sh, append=T)
saveWorkbook(wb, "C:/Users/Shoshana/Documents/CUNY SPS/cuny-sps/DATA_624/Projects/output/project1_forecasts.xlsx", overwrite = TRUE)
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
waterpipe1_data <- read.csv('https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_624/Projects/data/Waterflow_Pipe1.csv')
waterpipe2_data <- read.csv('https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_624/Projects/data/Waterflow_Pipe2.csv')
glimpse(waterpipe1_data)
## Rows: 1,000
## Columns: 2
## $ Date.Time <chr> "10/23/15 12:24 AM", "10/23/15 12:40 AM", "10/23/15 12:53 AM…
## $ WaterFlow <dbl> 23.369599, 28.002881, 23.065895, 29.972809, 5.997953, 15.935…
glimpse(waterpipe2_data)
## Rows: 1,000
## Columns: 2
## $ Date.Time <chr> "10/23/15 1:00 AM", "10/23/15 2:00 AM", "10/23/15 3:00 AM", …
## $ WaterFlow <dbl> 18.810791, 43.087025, 37.987705, 36.120379, 31.851259, 28.23…
First, we need to transform the Date.Time columns to
actual date types.
waterpipe1_data <- waterpipe1_data |>
mutate(Date.Time = as.POSIXct(Date.Time, format = "%m/%d/%y %I:%M %p"))
waterpipe2_data <- waterpipe2_data |>
mutate(Date.Time = as.POSIXct(Date.Time, format = "%m/%d/%y %I:%M %p"))
We can now round down to the hour and take an average for each hour.
hourly_pipe1 <- waterpipe1_data |>
mutate(Date.Time = floor_date(Date.Time, "hour")) |>
group_by(Date.Time) |>
summarize(avg_flow = mean(WaterFlow, na.rm=T)) |>
as_tsibble(index=Date.Time)
hourly_pipe2 <- waterpipe2_data |>
mutate(Date.Time = floor_date(Date.Time, "hour")) |>
group_by(Date.Time) |>
summarize(avg_flow = mean(WaterFlow, na.rm=T)) |>
as_tsibble(index=Date.Time)
Let’s look for any trends in the data.
There are some missing gaps in the data. We will use
fill_gaps() to create NA entries for any missing hours and
then we will fill any NA values using the last observation.
plot1 <- hourly_pipe1 |>
fill_gaps() |>
fill(avg_flow, .direction="down") |>
autoplot(avg_flow) +
labs(title = "Pipe 1: Hourly Water Flow", x = "Date", y = "Average Water Flow")
plot2 <- hourly_pipe1 |>
fill_gaps() |>
fill(avg_flow, .direction="down") |>
ACF(avg_flow) |>
autoplot() +
labs(title = "ACF")
plot_grid(plot1, plot2, ncol=1)
Pipe1 seems to be stationary. There are no outstanding lags in the ACF plot and no clear seasonal patterns.
plot1 <- hourly_pipe2 |>
fill_gaps() |>
fill(avg_flow, .direction="down") |>
autoplot(avg_flow) +
labs(title = "Pipe 2: Hourly Water Flow", x = "Date", y = "Average Water Flow")
plot2 <- hourly_pipe2 |>
fill_gaps() |>
fill(avg_flow, .direction="down") |>
ACF(avg_flow) |>
autoplot() +
labs(title = "ACF")
plot_grid(plot1, plot2, ncol=1)
Pipe2 also looks to be stationary. There are 3 outstanding lags in the ACF plot.
Let’s confirm these are stationary using a unit test.
hourly_pipe1 |>
features(avg_flow, unitroot_kpss) |>
knitr::kable()
| kpss_stat | kpss_pvalue |
|---|---|
| 0.2336022 | 0.1 |
hourly_pipe1 |>
features(avg_flow, unitroot_kpss) |>
knitr::kable()
| kpss_stat | kpss_pvalue |
|---|---|
| 0.2336022 | 0.1 |
Both pipe1 and pipe2 are clearly stationary.
We will use automatically generated ARIMA models to forecast.
pipe1_mod <- hourly_pipe1 |>
fill_gaps() |>
fill(avg_flow, .direction="down") |>
model(ARIMA(avg_flow))
report(pipe1_mod)
## Series: avg_flow
## Model: ARIMA(0,0,0) w/ mean
##
## Coefficients:
## constant
## 19.8504
## s.e. 0.2735
##
## sigma^2 estimated as 18.1: log likelihood=-690.42
## AIC=1384.85 AICc=1384.9 BIC=1391.82
pipe2_mod <- hourly_pipe2 |>
fill_gaps() |>
fill(avg_flow, .direction="down") |>
model(ARIMA(avg_flow))
report(pipe2_mod)
## Series: avg_flow
## Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean
##
## Coefficients:
## sma1 constant
## 0.0743 39.5820
## s.e. 0.0316 0.5422
##
## sigma^2 estimated as 256.3: log likelihood=-4195.3
## AIC=8396.6 AICc=8396.63 BIC=8411.33
For pipe1, the model generated is equivalent to a MEAN model. For pipe2, there is a seasonal MA(1) component but the model also uses the mean.
Let’s forecast for the next week using these models.
pipe1_fc <- pipe1_mod |>
forecast(h="1 week")
pipe1_fc |>
autoplot(hourly_pipe1) +
labs(title = "Pipe 1 Week Forecast", y = "Average Water Flow", x = "Date")
pipe2_fc <- pipe2_mod |>
forecast(h="1 week")
pipe2_fc |>
autoplot(hourly_pipe2) +
labs(title = "Pipe 2 Week Forecast", y = "Average Water Flow", x = "Date")
Pipe1 uses just the mean to forecast. Pipe2 starts off with some seasonality and then becomes steady as it uses the mean.
sh = addWorksheet(wb, "Pipe1 Forecasts")
xl_write(pipe1_fc, wb, sh, append=T)
saveWorkbook(wb, "C:/Users/Shoshana/Documents/CUNY SPS/cuny-sps/DATA_624/Projects/output/project1_forecasts.xlsx", overwrite = TRUE)
sh = addWorksheet(wb, "Pipe2 Forecasts")
xl_write(pipe2_fc, wb, sh, append=T)
saveWorkbook(wb, "C:/Users/Shoshana/Documents/CUNY SPS/cuny-sps/DATA_624/Projects/output/project1_forecasts.xlsx", overwrite = TRUE)