DATA 624 - Project 1

Part A - ATM Forecast

Prompt

Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable Cash is provided in hundreds of dollars. Explain and demonstrate your process, techniques used and not used, and your actual forecast. Provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual .rmd file.

Exploration

First, let’s load and take a look at the data.

atm_data <- read.csv("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_624/Projects/data/ATM624Data.csv", na.strings = c('', ' '))

# glimpse preliminary information
glimpse(atm_data)

## Rows: 1,474
## Columns: 3
## $ DATE <chr> "5/1/2009 12:00:00 AM", "5/1/2009 12:00:00 AM", "5/2/2009 12:00:0…
## $ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
## $ Cash <int> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …

Date is incorrectly coded as a character. This should be converted into a datetime. We will also cast ATM as a factor and turn the dataframe into a tsibble.

atm_data <- atm_data |>
  mutate(DATE = mdy_hms(DATE),
         DATE = as_date(DATE),
         ATM = as.factor(ATM)) |>
  as_tsibble(index=DATE, key=ATM)

# check values
summary(atm_data)

##       DATE              ATM           Cash        
##  Min.   :2009-05-01   ATM1:365   Min.   :    0.0  
##  1st Qu.:2009-08-01   ATM2:365   1st Qu.:    0.5  
##  Median :2009-11-01   ATM3:365   Median :   73.0  
##  Mean   :2009-10-31   ATM4:365   Mean   :  155.6  
##  3rd Qu.:2010-02-01   NA's: 14   3rd Qu.:  114.0  
##  Max.   :2010-05-14              Max.   :10920.0  
##                                  NA's   :19

The ATM information is missing from 14 entries and the Cash information is missing from 19.

Let’s take a look at the entries with missing ATM values.

atm_data |>
  filter(is.na(ATM)) |>
  knitr::kable()

DATE	ATM	Cash
2010-05-01	NA	NA
2010-05-02	NA	NA
2010-05-03	NA	NA
2010-05-04	NA	NA
2010-05-05	NA	NA
2010-05-06	NA	NA
2010-05-07	NA	NA
2010-05-08	NA	NA
2010-05-09	NA	NA
2010-05-10	NA	NA
2010-05-11	NA	NA
2010-05-12	NA	NA
2010-05-13	NA	NA
2010-05-14	NA	NA

The missing ATM entries all correspond to missing Cash entries for the first half of May. This is the month we are attempting to forecast. We can filter out these entries.

atm_data <- atm_data |>
  filter(!is.na(ATM))

Let’s take a look at the remaining missing Cash entries.

atm_data |>
  filter(is.na(Cash)) |>
  knitr::kable()

DATE	ATM	Cash
2009-06-13	ATM1	NA
2009-06-16	ATM1	NA
2009-06-22	ATM1	NA
2009-06-18	ATM2	NA
2009-06-24	ATM2	NA

There are three missing values for ATM1 and two missing values for ATM2.

Let’s take a look at the trends for these ATMs as they stand now.

atm_data |>
  autoplot(Cash) +
  labs(title = "ATM Withdrawal for Four ATMs (May 2009 - April 2010)", x = "Date")

ATM4 seems to have almost consistently the highest total withdraw amount. There is also a large outlier for this ATM somewhere between February and March of 2010. Let’s plot each ATM on their own scale so that we can see the trend for each.

atm_data |>
  autoplot(Cash) +
  facet_wrap(~ATM, ncol=1, scales = "free_y") +
  labs(title = "ATM Withdrawal for Four ATMs (May 2009 - April 2010)", x = "Date")

ATM4 has a large outlier. For ATM3, it seems all the values are 0 until the end of April 2010. Possible explanations for this could be that this ATM was out of order for the period up until this point or perhaps the ATM only began operation at this point. Another possibility is that there could have been withdrawals from this ATM but, due to rounding from measuring in hundreds, the total amount recorded does not accurately reflect the actual withdrawals (eg. if the total withdrawal for a day was $40, the amount in hundreds would round to 0). This explanation, however, is probably unlikely. Whatever the reason, the data for this ATM will not yield meaningful forecasts as the results will be heavily skewed by the $0 withdrawal amount.

Let’s take a look at the distributions for the other three ATMs.

plot1 <- atm_data |>
  filter(ATM != "ATM3") |>
  ggplot(aes(x=Cash)) +
  geom_histogram(aes(fill=ATM), bins=30) +
  facet_wrap(~ATM, nrow=1, scales = "free_x") +
  theme(legend.position="none") +
  labs(title = "Histograms of ATM Cash Withdrawals", y = "Count", x = "Cash (Hundreds of $)")

plot2 <- atm_data |>
  filter(ATM != "ATM3") |>
  ggplot(aes(x=Cash)) +
  geom_boxplot(aes(fill=ATM)) +
  facet_wrap(~ATM, nrow=1, scales = "free_x") +
  theme(legend.position="none") +
  labs(title = "Boxplots of ATM Cash Withdrawals", y = "", x = "Cash (Hundreds of $)")

plot_grid(plot1, plot2, ncol=1)

Here we can see the extreme outlier in ATM4 that we saw in the line plot. We can also see some outliers for ATM1.

ATM1 is missing 3 values for Cash (about 0.8% of the data). The data appears to be bimodally distributed with more observations present in the peak on the right. There also appears to be some outliers present towards the beginning and end of the distribution. As the data appears mostly normally distributed without these outliers, as can be seen from the boxplot, we will use median imputation for this ATM.

ATM2 is missing 2 values (about 0.5% of the data). ATM2 appears to be bimodally distributed with about equal observations in each peak. We will use mode imputation for this ATM.

We will use median imputation to replace the large outlier in ATM4.

# function for mode
mode <- function(x) {
   uniq_x <- unique(x)
   uniq_x[which.max(tabulate(match(x, uniq_x)))]
}

atm_data_imputed <- atm_data |>
  mutate(Cash = ifelse(Cash == max(atm_data$Cash, na.rm = T), NA, Cash)) |>
  group_by(ATM) |>
  mutate(Cash = case_when(
    is.na(Cash) & ATM %in% c("ATM1", "ATM4") ~ median(Cash, na.rm=T),
    is.na(Cash) & ATM == "ATM2" ~ mode(Cash),
    TRUE ~ Cash
  ))

Let’s take a look at the data now that the values have been imputed.

atm_data_imputed |>
  autoplot(Cash) +
  facet_wrap(~ATM, ncol=1, scales = "free_y") +
  labs(title = "ATM Withdrawal for Four ATMs (May 2009 - April 2010)", x = "Date")

atm_data_imputed |>
  ACF(Cash) |>
  autoplot() +
  facet_wrap(~ATM, ncol=1, scales = "free_y") +
  labs(title = "ACF Plots", x = "Date")

From a preliminary look at the data, it is hard to tell whether the trends are stationary or not. ATM1 and ATM2 appear to have some seasonality apparent from the significant lags at weekly intervals (7, 14, 21) in the ACF plots. These ATMs might not be stationary. We can use a unit root test to determine whether the data is stationary or not.

atm_data_imputed |>
  features(Cash, unitroot_kpss) |>
  knitr::kable()

ATM	kpss_stat	kpss_pvalue
ATM1	0.4651358	0.0495190
ATM2	1.9660900	0.0100000
ATM3	0.3891606	0.0818273
ATM4	0.1181178	0.1000000

ATM1 and ATM2 are not stationary while ATM3 and ATM4 are.

We can see how many orders of differencing must be done to make the data stationary.

atm_data_imputed |>
  features(Cash, unitroot_ndiffs)

## # A tibble: 4 × 2
##   ATM   ndiffs
##   <fct>  <int>
## 1 ATM1       1
## 2 ATM2       1
## 3 ATM3       0
## 4 ATM4       0

ATM1 and ATM2 both require seasonal differencing to make the data stationary.

atm_data_imputed |>
  features(difference(Cash,7), unitroot_ndiffs)

## # A tibble: 4 × 2
##   ATM   ndiffs
##   <fct>  <int>
## 1 ATM1       0
## 2 ATM2       0
## 3 ATM3       0
## 4 ATM4       0

No first order differencing is needed.

atm_data_imputed |>
  filter(ATM == "ATM1") |>
  gg_tsdisplay(difference(Cash, lag=7), 
               plot_type="partial") +
  labs(title = "ATM1: Seasonally Differenced")

We can see the seasonal lags in the PACF plot. There is one outstanding seasonal lag in the ACF plot at lag 7, indicating a possible seasonal MA(1) model. The spike at lag 1 in the ACF plot could also indicate a non-seasonal MA(1).

atm_data_imputed |>
  filter(ATM == "ATM2") |>
  gg_tsdisplay(difference(Cash, lag=7), 
               plot_type="partial") +
  labs(title = "ATM2: Seasonally Differenced")

Once again, we can see the seasonal spikes in the PACF plot. There are significant spiked in the ACF plot at lag 7 indicating a seasonal MA(1) model. There may also be a significant spike at lag 21 which could indicate an MA(2) model.

atm_data_imputed |>
  filter(ATM == "ATM4") |>
  gg_tsdisplay(Cash, 
               plot_type="partial") +
  labs(title = "ATM4")

We can see the seasonal spikes decaying in the ACF plot. There is an outstanding seasonal spike at lag 7 in the PACF plot indicating a seasonal AR(1) model.

Modeling

We can use seasonal ARIMA modeling to forecast future withdrawals for each ATM.

Let’s use the April 2010 data as a validation set and the rest of the data for training.

We will use an automatically generated ARIMA model and compare with our assumptions from the ACF and PACF plots.

atm1 <- atm_data_imputed |>
  filter(ATM == "ATM1")

atm1_mod <- atm1 |>
  model(auto = ARIMA(Cash),
        arima001011 = ARIMA(Cash ~ 0 + pdq(0,0,1) + PDQ(0,1,1, period=7)))

atm1_mod |>
  knitr::kable()

ATM	auto	arima001011
ATM1	<ARIMA(0,0,1)(0,1,2)[7]>	<ARIMA(0,0,1)(0,1,1)[7]>

report(atm1_mod) |>
  select(.model:BIC) |>
  knitr::kable()

.model	sigma2	log_lik	AIC	AICc	BIC
auto	559.7744	-1641.115	3290.231	3290.344	3305.753
arima001011	564.7218	-1643.061	3292.121	3292.189	3303.763

atm2 <- atm_data_imputed |>
  filter(ATM == "ATM2")

atm2_mod <- atm2 |>
  model(auto = ARIMA(Cash),
        arima000011 = ARIMA(Cash ~ 0 + pdq(0,0,0) + PDQ(0,1,1, period=7)),
        arima000012 = ARIMA(Cash ~ 0 + pdq(0,0,0) + PDQ(0,1,2, period=7)))

atm2_mod |>
  knitr::kable()

ATM	auto	arima000011	arima000012
ATM2	<ARIMA(2,0,2)(0,1,1)[7]>	<ARIMA(0,0,0)(0,1,1)[7]>	<ARIMA(0,0,0)(0,1,2)[7]>

report(atm2_mod) |>
  select(.model:BIC) |>
  knitr::kable()

.model	sigma2	log_lik	AIC	AICc	BIC
auto	588.0783	-1649.266	3310.531	3310.771	3333.814
arima000011	620.7961	-1660.382	3324.764	3324.798	3332.525
arima000012	622.0567	-1660.237	3326.473	3326.541	3338.115

atm4 <- atm_data_imputed |>
  filter(ATM == "ATM4")

atm4_mod <- atm4 |>
  model(auto = ARIMA(Cash),
        arima000100 = ARIMA(Cash ~ 0 + pdq(0,0,0) + PDQ(1,0,0, period=7)))

atm4_mod |>
  knitr::kable()

ATM	auto	arima000100
ATM4	<ARIMA(3,0,2)(1,0,0)[7] w/ mean>	<ARIMA(0,0,0)(1,0,0)[7]>

report(atm4_mod) |>
  select(.model:BIC) |>
  knitr::kable()

.model	sigma2	log_lik	AIC	AICc	BIC
auto	117510.3	-2645.177	5306.353	5306.758	5337.553
arima000100	171786.4	-2719.433	5442.865	5442.898	5450.665

In each case, the automatically generated ARIMA model for the ATM has a lower AIC than the manually created ones. For ATM1, the AIC for the manually created ARIMA model is only slightly larger than the AIC of the automatically generated model and the BIC is lower.

Let’s forecast using the automatically generated models.

atm_train <- atm_data_imputed |>
  filter(DATE < "2010-04-01")

atms_mod <- atm_train |>
  filter(ATM != "ATM3") |>
  model(ARIMA(Cash))

fc <- atms_mod |>
  forecast(h=61)

There is not enough information from ATM3 to accurately forecast future withdrawals, as there are only 3 recorded days. For this, we will use the NAIVE method to forecast.

atm3 <- atm_data_imputed |>
  filter(ATM == "ATM3")

atm3_mod <- atm3 |>
  model(NAIVE(Cash))

atm3_fc <- atm3_mod |>
  forecast(h=31)

fc <- fc |>
  bind_rows(atm3_fc)

fc |>
  autoplot(atm_data_imputed)

The models for ATM1 and ATM2 seem to somewhat accurately forecast future withdrawals. The model for ATM4 combines the ARIMA model with a mean and does not as accurately forecast.

atms_mod |>
  filter(ATM == "ATM1") |>
  gg_tsresiduals() +
  labs(title = "ATM1")

atms_mod |>
  filter(ATM == "ATM2") |>
  gg_tsresiduals() +
  labs(title = "ATM2")

atms_mod |>
  filter(ATM == "ATM4") |>
  gg_tsresiduals() +
  labs(title = "ATM4")

The residuals of the models for ATM1 and ATM2 closely resemble the white noise series.

may_forecasts <- fc |>
  filter(DATE >= "2010-05-01")

wb = createWorkbook()
sh = addWorksheet(wb, "ATM Forecasts")

xl_write(may_forecasts, wb, sh)

saveWorkbook(wb, "C:/Users/Shoshana/Documents/CUNY SPS/cuny-sps/DATA_624/Projects/output/project1_forecasts.xlsx", overwrite = TRUE)

Part B – Forecasting Power

Prompt

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Exploration

Let’s load the data.

power_data <- read.csv('https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_624/Projects/data/ResidentialCustomerForecastLoad-624.csv')

# glimpse preliminary information
glimpse(power_data)

## Rows: 192
## Columns: 3
## $ CaseSequence <int> 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 74…
## $ YYYY.MMM     <chr> "1998-Jan", "1998-Feb", "1998-Mar", "1998-Apr", "1998-May…
## $ KWH          <int> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 891…

The data consists of 3 columns.

CaseSequence is an ID column and is unecessary for the purposes of this analysis. It can be removed.

The date column (YYYY.MMM) is coded as a character. In order to perform a time series analysis on this data, we will need to format this as a date.

We will also turn this table into a tsibble.

power_data_ts <- power_data |>
  rename(Date = YYYY.MMM) |>
  mutate(Date = yearmonth(Date)) |>
  select(-CaseSequence) |>
  as_tsibble(index=Date)

Let’s check if there are any missing values.

colSums(is.na(power_data_ts))

## Date  KWH 
##    0    1

There is one missing value for KWH. Let’s see which date is missing this recorded value.

power_data_ts |>
  filter(is.na(KWH)) |>
  knitr::kable()

Date	KWH
2008 Sep	NA

September 2008 is missing the recorded KWH.

Let’s take a look at the distribution of KWH and the trend of the data to address this value.

plot1 <- power_data_ts |>
  autoplot(KWH) +
  labs(title = "Lineplot of Power Usage", x = "Date") +
  scale_y_continuous(labels=scales::comma)

plot2 <- power_data_ts |>
  ggplot(aes(x = KWH)) +
  geom_histogram(bins=15, fill='lightblue') +
  scale_x_continuous(labels=scales::comma) +
  labs(title = "Histogram of KWH")

plot3 <- power_data_ts |>
  ggplot(aes(x = KWH)) +
  geom_boxplot(fill='lightblue') +
  scale_x_continuous(labels=scales::comma) +
  labs(title = "Boxplot of KWH")

plot_grid(plot1, plot_grid(plot2, plot3, nrow=1), ncol=1)

The line plot shows seasonality in the data. It also shows a major outlier of sometime in 2010. We can also see this outlier present in the histogram and boxplot.

We will need to deal with the missing point as well as this outlier.

What is the outlier?

power_data_ts |>
  filter(KWH < 3000000)

## # A tsibble: 1 x 2 [1M]
##       Date    KWH
##      <mth>  <int>
## 1 2010 Jul 770523

July 2010 is the outlier with only 770,052 KWH used in that month. Every other month used in the millions. This could have been a typo and maybe there is a missing number.

# remove outlier
outlier_ind <- which(power_data_ts$Date == yearmonth("2010 Jul"))
power_data_ts$KWH[outlier_ind] = NA

Let’s take a look at the lags for this variable to see how we should impute the missing values.

power_data_ts |>
  gg_lag(KWH, geom="point", lags = 1:12)

Based on the lags, the data is most similar at yearly intervals.

power_data_ts |>
  gg_season(KWH) +
  scale_y_continuous(labels=scales::comma) +
  labs(title = "Seasonal Decomposition of KWH", x = "Month")

The data follows a clear seasonal trend. There is more power usage in the beginning, middle, and end of the year.

To impute the missing values, I will use an average of the power usage at the same month for two years prior.

impute_jul <- power_data |>
  filter(YYYY.MMM %in% c("2008-Jul", "2009-Jul"))

impute_sep <- power_data |>
  filter(YYYY.MMM %in% c("2006-Sep", "2007-Sep")) 

# impute
power_data_ts <- power_data_ts |>
  mutate(KWH = ifelse(Date == yearmonth("2010 Jul"), mean(impute_jul$KWH), KWH),
         KWH = ifelse(Date == yearmonth("2008 Sep"), mean(impute_sep$KWH), KWH))

power_data_ts |>
  autoplot(KWH) +
  labs(title = "Lineplot of Power Usage", x = "Date") +
  scale_y_continuous(labels=scales::comma)

Modeling

Due to the clear seasonality in the data, we will need to use seasonal differencing to make this data stationary.

power_data_ts |>
  gg_tsdisplay(difference(KWH, lag=12), 
               plot_type="partial") +
  labs(title = "Power Usage: Seasonally Differenced")

Let’s see if a further first order difference is required. The data is considered stationary after the seasonal differencing.

power_data_ts |>
  features(difference(KWH, lag=12), unitroot_kpss) |>
  knitr::kable()

kpss_stat	kpss_pvalue
0.1623246	0.1

No further differencing is required.

The ACF and PACF plots are quite similar for the seasonally differenced data. We can see an outstanding lag at lag 12 in both. We can use either a seasonal AR(1) model or a seasonal MA(1) model. There is a possible nonseasonal outstanding lag seen at lag 3 in both plots, so we can also try using a nonseasonal MA(1) model.

There is possibly a slight upward trend in the data, so we will also test a model with drift.

We will use 2012 and 2013 as validation sets.

power_mods <- power_data_ts |>
  filter(year(Date) < 2012) |>
  model(auto = ARIMA(KWH),
        arima000011 = ARIMA(KWH ~ 0 + pdq(0,0,0) + PDQ(0,1,1, period=12)),
        arima000110 = ARIMA(KWH ~ 0 + pdq(0,0,0) + PDQ(1,1,0, period=12)),
        arima001011 = ARIMA(KWH ~ 0 + pdq(0,0,1) + PDQ(0,1,1, period=12)),
        arima001110 = ARIMA(KWH ~ 0 + pdq(0,0,1) + PDQ(1,1,0, period=12)),
        arima001011drift = ARIMA(KWH ~ 1 + pdq(0,0,1) + PDQ(0,1,1, period=12)),
        arima001110drift = ARIMA(KWH ~ 1 + pdq(0,0,1) + PDQ(1,1,0, period=12)))

power_mods |>
  knitr::kable()

auto	arima000011	arima000110	arima001011	arima001110	arima001011drift	arima001110drift
<ARIMA(0,0,1)(2,1,0)[12] w/ drift>	<ARIMA(0,0,0)(0,1,1)[12]>	<ARIMA(0,0,0)(1,1,0)[12]>	<ARIMA(0,0,1)(0,1,1)[12]>	<ARIMA(0,0,1)(1,1,0)[12]>	<ARIMA(0,0,1)(0,1,1)[12] w/ drift>	<ARIMA(0,0,1)(1,1,0)[12] w/ drift>

report(power_mods) |>
  select(.model:BIC) |>
  knitr::kable()

.model	sigma2	log_lik	AIC	AICc	BIC
auto	358748449455	-2298.737	4607.474	4607.874	4622.723
arima000011	473934089563	-2320.148	4644.297	4644.375	4650.396
arima000110	491517415472	-2322.441	4648.882	4648.961	4654.982
arima001011	385874805250	-2304.657	4615.313	4615.471	4624.463
arima001110	429148017595	-2311.241	4628.482	4628.640	4637.632
arima001011drift	356469559457	-2299.458	4606.915	4607.180	4619.115
arima001110drift	418376486644	-2308.873	4625.747	4626.011	4637.946

The ARIMA(0,0,1)(0,1,1)[12] model with drift has the lowest AIC and BIC.

power_mods |>
  forecast(h="3 years") |>
  autoplot(power_data_ts |> filter(year(Date) > 2010), level= NULL) +
  labs(title = "KWH Forecast Models") +
  scale_y_continuous(labels=scales::comma)

Each model closely follows the data, although the ARIMA(0,0,1)(1,1,0)[12] model and ARIMA(0,0,1)(1,1,0)[12] with drift both seem to overestimate in mid-2013.

Let’s compare the residuals of the automatically generated model with the ARIMA(0,0,1)(0,1,1)[12] model with drift.

power_mods |>
  select(auto) |>
  gg_tsresiduals() +
  labs(title = "Auto ARIMA")

power_mods |>
  select(arima001011drift) |>
  gg_tsresiduals() +
  labs(title = "ARIMA(0,0,1)(0,1,1)[12] with drift")

The ARIMA(0,0,1)(0,1,1)[12] model with drift resembles white noise slightly better than than automatically generated model. There is one lag that seems autocorrelated but it is much smaller than the autocorrelated lag in the automatically generated model. The distribution of the residuals looks normal and there is no heteroskedasticity in the variation.

We will forecast for 2014 using the ARIMA(0,0,1)(0,1,1)[12] model with drift.

fc_2014 <- power_mods |>
  select(arima001011drift) |>
  forecast(h="3 years") |>
  filter(year(Date)  == 2014)
  
sh = addWorksheet(wb, "Power Forecasts")

xl_write(fc_2014, wb, sh, append=T)

saveWorkbook(wb, "C:/Users/Shoshana/Documents/CUNY SPS/cuny-sps/DATA_624/Projects/output/project1_forecasts.xlsx", overwrite = TRUE)

Part C – BONUS

Prompt

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Load Data

waterpipe1_data <- read.csv('https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_624/Projects/data/Waterflow_Pipe1.csv')

waterpipe2_data <- read.csv('https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_624/Projects/data/Waterflow_Pipe2.csv')

glimpse(waterpipe1_data)

## Rows: 1,000
## Columns: 2
## $ Date.Time <chr> "10/23/15 12:24 AM", "10/23/15 12:40 AM", "10/23/15 12:53 AM…
## $ WaterFlow <dbl> 23.369599, 28.002881, 23.065895, 29.972809, 5.997953, 15.935…

glimpse(waterpipe2_data)

## Rows: 1,000
## Columns: 2
## $ Date.Time <chr> "10/23/15 1:00 AM", "10/23/15 2:00 AM", "10/23/15 3:00 AM", …
## $ WaterFlow <dbl> 18.810791, 43.087025, 37.987705, 36.120379, 31.851259, 28.23…

Aggregate Data

First, we need to transform the Date.Time columns to actual date types.

waterpipe1_data <- waterpipe1_data |>
  mutate(Date.Time = as.POSIXct(Date.Time, format = "%m/%d/%y %I:%M %p"))

waterpipe2_data <- waterpipe2_data |>
  mutate(Date.Time = as.POSIXct(Date.Time, format = "%m/%d/%y %I:%M %p"))

We can now round down to the hour and take an average for each hour.

hourly_pipe1 <- waterpipe1_data |>
  mutate(Date.Time = floor_date(Date.Time, "hour")) |>
  group_by(Date.Time) |>
  summarize(avg_flow = mean(WaterFlow, na.rm=T)) |>
  as_tsibble(index=Date.Time)

hourly_pipe2 <- waterpipe2_data |>
  mutate(Date.Time = floor_date(Date.Time, "hour")) |>
  group_by(Date.Time) |>
  summarize(avg_flow = mean(WaterFlow, na.rm=T)) |>
  as_tsibble(index=Date.Time)

Let’s look for any trends in the data.

There are some missing gaps in the data. We will use fill_gaps() to create NA entries for any missing hours and then we will fill any NA values using the last observation.

plot1 <- hourly_pipe1 |>
  fill_gaps() |>
  fill(avg_flow, .direction="down") |>
  autoplot(avg_flow) +
  labs(title = "Pipe 1: Hourly Water Flow", x = "Date", y = "Average Water Flow")

plot2 <- hourly_pipe1 |>
  fill_gaps() |>
  fill(avg_flow, .direction="down") |>
  ACF(avg_flow) |>
  autoplot() +
  labs(title = "ACF")

plot_grid(plot1, plot2, ncol=1)

Pipe1 seems to be stationary. There are no outstanding lags in the ACF plot and no clear seasonal patterns.

plot1 <- hourly_pipe2 |>
  fill_gaps() |>
  fill(avg_flow, .direction="down") |>
  autoplot(avg_flow) +
  labs(title = "Pipe 2: Hourly Water Flow", x = "Date", y = "Average Water Flow")

plot2 <- hourly_pipe2 |>
  fill_gaps() |>
  fill(avg_flow, .direction="down") |>
  ACF(avg_flow) |>
  autoplot() +
  labs(title = "ACF")

plot_grid(plot1, plot2, ncol=1)

Pipe2 also looks to be stationary. There are 3 outstanding lags in the ACF plot.

Let’s confirm these are stationary using a unit test.

hourly_pipe1 |>
  features(avg_flow, unitroot_kpss) |>
  knitr::kable()

kpss_stat	kpss_pvalue
0.2336022	0.1

hourly_pipe1 |>
  features(avg_flow, unitroot_kpss) |>
  knitr::kable()

kpss_stat	kpss_pvalue
0.2336022	0.1

Both pipe1 and pipe2 are clearly stationary.

Forecasting

We will use automatically generated ARIMA models to forecast.

pipe1_mod <- hourly_pipe1 |>
  fill_gaps() |>
  fill(avg_flow, .direction="down") |>
  model(ARIMA(avg_flow))

report(pipe1_mod)

## Series: avg_flow 
## Model: ARIMA(0,0,0) w/ mean 
## 
## Coefficients:
##       constant
##        19.8504
## s.e.    0.2735
## 
## sigma^2 estimated as 18.1:  log likelihood=-690.42
## AIC=1384.85   AICc=1384.9   BIC=1391.82

pipe2_mod <- hourly_pipe2 |>
  fill_gaps() |>
  fill(avg_flow, .direction="down") |>
  model(ARIMA(avg_flow))

report(pipe2_mod)

## Series: avg_flow 
## Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean 
## 
## Coefficients:
##         sma1  constant
##       0.0743   39.5820
## s.e.  0.0316    0.5422
## 
## sigma^2 estimated as 256.3:  log likelihood=-4195.3
## AIC=8396.6   AICc=8396.63   BIC=8411.33

For pipe1, the model generated is equivalent to a MEAN model. For pipe2, there is a seasonal MA(1) component but the model also uses the mean.

Let’s forecast for the next week using these models.

pipe1_fc <- pipe1_mod |>
  forecast(h="1 week")

pipe1_fc |>
  autoplot(hourly_pipe1) +
  labs(title = "Pipe 1 Week Forecast", y = "Average Water Flow", x = "Date")

pipe2_fc <- pipe2_mod |>
  forecast(h="1 week")

pipe2_fc |>
  autoplot(hourly_pipe2) +
  labs(title = "Pipe 2 Week Forecast", y = "Average Water Flow", x = "Date")

Pipe1 uses just the mean to forecast. Pipe2 starts off with some seasonality and then becomes steady as it uses the mean.

sh = addWorksheet(wb, "Pipe1 Forecasts")

xl_write(pipe1_fc, wb, sh, append=T)

saveWorkbook(wb, "C:/Users/Shoshana/Documents/CUNY SPS/cuny-sps/DATA_624/Projects/output/project1_forecasts.xlsx", overwrite = TRUE)

sh = addWorksheet(wb, "Pipe2 Forecasts")

xl_write(pipe2_fc, wb, sh, append=T)

saveWorkbook(wb, "C:/Users/Shoshana/Documents/CUNY SPS/cuny-sps/DATA_624/Projects/output/project1_forecasts.xlsx", overwrite = TRUE)

DATA 624 - Project 1

Shoshana Farber

2024-03-19

Part A - ATM Forecast

Prompt

Exploration

Modeling

Part B – Forecasting Power

Prompt

Exploration

Modeling

Part C – BONUS

Prompt

Load Data

Aggregate Data

Forecasting