Forecast how much cash is taken out of 4 different ATM machines for
May 2010. The variable Cash is provided in hundreds of
dollars, other than that it is straight forward. Explain and demonstrate
your process, techniques used and not used, and your actual
forecast.
First examine the data. It is a tibble with a DATE field
double data type, ATM character type and Cash
double data type. Date should be transformed into a true date type, and
the distribution of the data across each of the 4 ATMs should be
examined.
(ATM <- readxl::read_xlsx("C:/data/ATM624Data.xlsx"))
## # A tibble: 1,474 x 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
## 7 39937 ATM1 90
## 8 39937 ATM2 55
## 9 39938 ATM1 99
## 10 39938 ATM2 79
## # ... with 1,464 more rows
There is quite a bit of cleanup to do. Creating a new dataframe from
the imported data, set a new Day variable from the
DATE by mutating it as a date with origin of 1900-01-01,
pivot the table wider so that each ATM has its own variable of the Cash
amount withdrawn from it, remove the original DATE and
pivoted NA fields, then filter for only dates prior to May
1, 2010. Lastly creating a tsibble from the newly transformed data.
ATMs <- ATM %>%
# convert the numeric date to a true Date
mutate(tsDay = as.Date(DATE, origin = "1899-12-30")) %>%
# pivot the table from long to wide
pivot_wider(names_from = ATM, values_from = Cash) %>%
# remove the original numeric DATE and the pivoted 'NA' variables
select(-DATE,-"NA") %>%
# filter the data to not include May 2010, which will be forecast
filter(tsDay < "2010-05-01") %>%
# set the data into a tsibble for time series analysis
as_tsibble(index=tsDay)
head(ATMs)
## # A tsibble: 6 x 5 [1D]
## tsDay ATM1 ATM2 ATM3 ATM4
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2009-05-01 96 107 0 777.
## 2 2009-05-02 82 89 0 524.
## 3 2009-05-03 85 90 0 793.
## 4 2009-05-04 90 55 0 908.
## 5 2009-05-05 99 79 0 52.8
## 6 2009-05-06 88 19 0 52.2
The data is looking more like a time series now, but checking the
summary statistics there are still some issues, like NA values for
ATM1 and ATM2 and a large outlier in
ATM4.
summary(ATMs)
## tsDay ATM1 ATM2 ATM3
## Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : 0.0000
## 1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: 0.0000
## Median :2009-10-30 Median : 91.00 Median : 67.00 Median : 0.0000
## Mean :2009-10-30 Mean : 83.89 Mean : 62.58 Mean : 0.7206
## 3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000
## Max. :2010-04-30 Max. :180.00 Max. :147.00 Max. :96.0000
## NA's :3 NA's :2
## ATM4
## Min. : 1.563
## 1st Qu.: 124.334
## Median : 403.839
## Mean : 474.043
## 3rd Qu.: 704.507
## Max. :10919.762
##
Since there are only 3 missing values for ATM1 and 2 for
ATM2 it is best to impute a value for each respective ATM
in order to preserve the structural characteristics of the original data
for each machine. After the imputation it is good to re-check the
summary structure. In order to find the best method to impute the
missing values, we need to look at the distributions of the ATM1 and
ATM2 data.
par(mfrow=c(2,2))
ATM1 <- hist(ATMs$ATM1, col = "lightgreen", main = "", xlab = "ATM1", breaks = 10)
ATM2 <- hist(ATMs$ATM2, col = "lightblue", main = "", xlab = "ATM2", breaks = 10)
ATM3 <- hist(ATMs$ATM3, col = "lightyellow", main = "", xlab = "ATM3", breaks = 10)
ATM4 <- hist(ATMs$ATM4, col = "lightpink", main = "", xlab = "ATM4", breaks = 10)
ATM1 is mostly normally distributed except for a large
number of very small transaction days, which actually makes it bimodal.
ATM2 is closer to being uniformly distributed.
ATM3 has practically no transactions at all, although the
statistics above confirm that the and Median and 3rd Quartile are $0,
while its Mean is $72 over only 3 non-zero transactions. Lastly
ATM4 has large outliers that are extremely right skewing
the scale of the x-axis.
Since neither of the two ATMs with missing values is completely normally distributed, it is best to impute with the median value. This removes the NA values as seen below, and only decreases the median of ATM by 1 ceteris paribus. ATMs is affected more in the 1st Quantile and Median, but the Mean, 3rd and Max values are unchanged.
ATMs <- ATMs %>%
mutate(ATM1 = ifelse(is.na(ATM1), median(ATM1, na.rm = TRUE), ATM1)) %>%
mutate(ATM2 = ifelse(is.na(ATM2), median(ATM2, na.rm = TRUE), ATM2))
summary(ATMs)
## tsDay ATM1 ATM2 ATM3
## Min. :2009-05-01 Min. : 1.00 Min. : 0.0 Min. : 0.0000
## 1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 26.0 1st Qu.: 0.0000
## Median :2009-10-30 Median : 91.00 Median : 67.0 Median : 0.0000
## Mean :2009-10-30 Mean : 83.95 Mean : 62.6 Mean : 0.7206
## 3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.0 3rd Qu.: 0.0000
## Max. :2010-04-30 Max. :180.00 Max. :147.0 Max. :96.0000
## ATM4
## Min. : 1.563
## 1st Qu.: 124.334
## Median : 403.839
## Mean : 474.043
## 3rd Qu.: 704.507
## Max. :10919.762
Let’s plot the series for each ATM so see if the time series for each ATM reflects its respective distribution.
ATM1 has no clear trend but extreme seasonality. Looking at the STL decomposition and its residuals would be helpful to clarify what exactly is happening in it.
ATM1 <- ATMs %>%
select(tsDay,ATM1)
ATM1 %>%
autoplot(ATM1) +
labs(title = "ATM1", y = "Cash (H)")
The decomposition confirms that there is no clear upward or downward trend, and there is regular seasonality. What appears to be odd is that it looks like there is so much seasonality that there is some left in the most recent remainder for the last quarter.
ATM1_decomp <- ATM1 %>%
model(STL(ATM1 ~ trend(window = 24) +
season(window = "periodic"),
robust = TRUE))
ATM1_decomp %>%
components() %>%
autoplot()
Next the residuals
The gg_tsrediduals plot shows that there is seasonality
in the residuals in the last three months or so, and the
acf plot confirms that there is significant periodicity in
the lags and no stationarity, such as paychecks being deposited weekly
on the positive side and two days per week that see more withdrawals
than other days. Perhaps a large employer in the area switched to direct
deposits in the last quarter? But enough with conjecture…back to the
analysis.
ATM1_decomp %>%
gg_tsresiduals()
Given that there is not much trend but there is a lot of seasonality, the SNAIVE method may be the best way to forecast the month of May for ATM1. We can see that the seasonality of the days of the week is preserved but the forecast predicts the possibility of negative activity, which is not good for an ATM, and a very very wide confidence interval.
ATM1 %>%
model(SNAIVE(ATM1)) %>%
forecast(h = 31) %>%
autoplot(ATM1) +
labs(y = "Cash (H)", title = "1-month forecast of ATM1 activity")
Checking Box-Cox for transformation options, a lambda of 0.3406 is closest to 0.5, which implies that a square root transform is in order.
lambda <- ATM1 %>%
features(ATM1, features = guerrero) %>%
pull(lambda_guerrero)
ATM1 %>%
autoplot(box_cox(ATM1,lambda)) +
labs(y = "",
title = paste0("Transformed ATM1 activity with lambda = ",round(lambda,4)))
Applying the SNAIVE method to a square root transformed ATM1 series re-forecasts the month of May for ATM1 with no negative predictions. We can see that the seasonality of the days of the week is still preserved, but now the forecast predicts a slightly upward drift over the next four weeks (or 31 days), still with a very wide confidence interval, but not as wide as the prior attempt.
ATM1 %>%
model(SNAIVE(sqrt(ATM1))) %>%
forecast(h = 31) %>%
autoplot(ATM1) +
labs(y = "Cash (H)", title = "1-month forecast of ATM1 activity")
Onward to ATM2! Here is the time series plot.
ATM2 has a somewhat decreasing overall trend and extreme seasonality similar to ATM1. Again, looking at the STL decomposition and its residuals would be very enlightening.
ATM2 <- ATMs %>%
select(tsDay,ATM2)
ATM2 %>%
autoplot(ATM2) +
labs(title = "ATM2", y = "Cash (H)")
The decomposition confirms the downward trend in this case, slowly declining for three quarters and then taking a dive at the beginning of the last quarter. There is again regular seasonality like with ATM1, and seeing the same seasonality in the most recent remainder is beginning to be convincing of the local employer’s investment in direct deposits.
ATM2_decomp <- ATM2 %>%
model(STL(ATM2 ~ trend(window = 24) +
season(window = "periodic"),
robust = TRUE))
ATM2_decomp %>%
components() %>%
autoplot()
The residuals for ATM2 show that there is seasonality present in the
last quarter of activity, and the acf confirms that there
is significant periodicity in the lag. However, in contrast to ATM1
which had many small transactions, ATM2 sees larger positive
transactions. Perhaps management banks at ATM2?
ATM2_decomp %>%
gg_tsresiduals()
In any event, given that there is both declining trend with a lot of seasonality, the DRIFT method may be the best way to forecast the month of May for ATM2. But given the large dip in the trend it would be wise to look at Box-Cox transformation options before trying to forecast the data.
A lambda of 0.6798 is a little higher than 0.5, which implies that a square root transform is in order.
lambda <- ATM2 %>%
features(ATM2, features = guerrero) %>%
pull(lambda_guerrero)
ATM2 %>%
autoplot(box_cox(ATM2,lambda)) +
labs(y = "",
title = paste0("Transformed ATM2 activity with lambda = ",round(lambda,4)))
Applying the DRIFT method, or Random Walk, to a square root transformed ATM2 series forecasts the month of May for ATM2 with a significant upward trend. This looks wrong, so maybe I was wrong about the trend? Let’s try forecasting using the Seasonal Naive method, since the seasonality was so much stronger than the weak apparent trend.
ATM2 %>%
model(RW(sqrt(ATM2) ~ drift())) %>%
forecast(h = 31) %>%
autoplot(ATM2) +
labs(y = "Cash (H)", title = "1-month forecast of ATM2 activity")
This is much better! Using SNAIVE preserves the seasonality of the days of the week, and the forecast still predicts a slightly upward drift over the next four weeks (or 31 days), again with a very wide confidence interval, but much more rationally believable than the RW forecast attempt.
ATM2 %>%
model(SNAIVE(sqrt(ATM2))) %>%
forecast(h = 31) %>%
autoplot(ATM2) +
labs(y = "Cash (H)", title = "1-month forecast of ATM2 activity")
Here is the raw data for ATM3. To be honest, since there are only 3 observations in the data set there is nothing more do to than make a NAIVE forecast and move on to ATM4. Imputation can’t help when there are only 3 values in an entire year.
ATM3 <- ATMs %>%
select(tsDay,ATM3)
ATM3 %>%
autoplot(ATM3) +
labs(title = "ATM3", y = "Cash (H)")
This forecast makes me think of a satellite dish, or perhaps R2D2’s comms antenna? Beep boop squeee beep boop! (Hey, even if there’s not much to do with it, you still gotta have fun.)
ATM3 %>%
model(NAIVE(ATM3)) %>%
forecast(h = 31) %>%
autoplot(ATM3) +
labs(y = "Cash (H)", title = "1-month forecast of ATM3 activity")
The ATM4 time series is plotted last, and there might possibly be an outlier somewhere in this plot (maybe?) Actually, we did verify statistically above that the Maximum transaction for ATM4 was $10,919.76, so we actually already confirmed that there is a very high value in the data. It should be examined more closely.
ATM4 <- ATMs %>%
select(tsDay,ATM4)
ATM4 %>%
autoplot(ATM4) +
labs(title = "ATM4", y = "Cash (H)")
Checking Box-Cox transformation options, a lambda of -0.1229 is closest to 0, which suggests a log transform. A log transform would help to at least reduce the outlier (if there was one) and it would also affect all of the other values similarly.
lambda <- ATM4 %>%
features(ATM4, features = guerrero) %>%
pull(lambda_guerrero)
ATM4 %>%
autoplot(box_cox(ATM4,lambda)) +
labs(y = "",
title = paste0("Transformed ATM4 activity with lambda = ",round(lambda,4)))
Let us pause to revisit the data for ATM4 and see exactly how many outliers there are and how large they are.
(top10 <- ATM4 %>%
arrange(desc(ATM4)) %>%
slice(1:5))
## # A tsibble: 5 x 2 [1D]
## tsDay ATM4
## <date> <dbl>
## 1 2010-02-09 10920.
## 2 2009-09-22 1712.
## 3 2010-01-29 1575.
## 4 2009-09-04 1495.
## 5 2009-06-09 1484.
There is only one extreme outlier in the top 5 values in the ATM4 transaction totals per day over the past year, and it is more than five times the size of the second largest value. If this were not a time series exercise, it might be possible to remove the value entirely, but it is a large influential part of the overall structure of the ATM4 data and removing an entire day would break the time series. So in order to remediate the outlier in this instance, it might be best to set the largest value equal to the second largest value, which would rescale the outlier without significantly impacting the distribution of the data for ATM4.
Here is the distribution before any modifications.
summary(ATM4)
## tsDay ATM4
## Min. :2009-05-01 Min. : 1.563
## 1st Qu.:2009-07-31 1st Qu.: 124.334
## Median :2009-10-30 Median : 403.839
## Mean :2009-10-30 Mean : 474.043
## 3rd Qu.:2010-01-29 3rd Qu.: 704.507
## Max. :2010-04-30 Max. :10919.762
Now replace the Max value with the second highest Max value.
highval1 <- ATM4 %>% filter(tsDay == "2010-02-09") # get the highest value row
highval2 <- ATM4 %>% filter(tsDay == "2009-09-22") # get the second highest value row
ATM4$ATM4[ATM4$tsDay == highval1$tsDay] <- highval2$ATM4 # replace the highest value with the second highest value
Verify that the highest outlier is now set equal to the second highest outlier. We see that it is.
(top10 <- ATM4 %>%
arrange(desc(ATM4)) %>%
slice(1:5))
## # A tsibble: 5 x 2 [1D]
## tsDay ATM4
## <date> <dbl>
## 1 2009-09-22 1712.
## 2 2010-02-09 1712.
## 3 2010-01-29 1575.
## 4 2009-09-04 1495.
## 5 2009-06-09 1484.
Re-examine the distribution for ATM4. The Mean and the Max decreased, bringing the Mean closer to the Median, but the Median and the rest of the values are unaffected.
summary(ATM4)
## tsDay ATM4
## Min. :2009-05-01 Min. : 1.563
## 1st Qu.:2009-07-31 1st Qu.: 124.334
## Median :2009-10-30 Median : 403.839
## Mean :2009-10-30 Mean : 448.817
## 3rd Qu.:2010-01-29 3rd Qu.: 704.507
## Max. :2010-04-30 Max. :1712.075
Re-examine the time series plot as well. It appears that the outlier was successfully penalized, so now we can more clearly see that there is no apparent trend and not a even a lot of apparent seasonality, very little of which was visible before.
ATM4 %>%
autoplot(ATM4) +
labs(title = "ATM4", y = "Cash (H)")
Last, rechecking Box-Cox transformation options after addressing the one outlier, the lambda is now 0.03965, which is now closest to 0.5, which suggests a square root transform. This is a considerably different recommendation from the log transform with the one outlier.
lambda <- ATM4 %>%
features(ATM4, features = guerrero) %>%
pull(lambda_guerrero)
ATM4 %>%
autoplot(box_cox(ATM4,lambda)) +
labs(y = "",
title = paste0("Transformed ATM4 activity with lambda = ",round(lambda,4)))
Moving on to the STL decomposition with an applied sqrt transformation, there is clearly no consistent trend, but there is in fact a lot of seasonality which was not visible until the decomposition was taken. This suggests the SNAIVE forecast would work well, similar to ATM1.
ATM4_decomp <- ATM4 %>%
model(STL(sqrt(ATM4) ~ trend(window = 24) +
season(window = "periodic"),
robust = TRUE))
ATM4_decomp %>%
components() %>%
autoplot()
The SNAIVE forecast preserves the seasonality, and we also see a somewhat increasing trend projected into the next month. Perhaps this is in line with what one would expect from the ATM that receives the largest outlying deposits from time to time?
ATM4 %>%
model(SNAIVE(sqrt(ATM4))) %>%
forecast(h = 31) %>%
autoplot(ATM4) +
labs(y = "Cash (H)", title = "1-month forecast of ATM4 activity")
Use a simple dataset of residential power usage for January 1998
until December 2013 to model the data and a monthly forecast for 2014.
The variable KWH is power consumption in Kilowatt hours,
the rest is straight forward.
First examine the data. It is again a tibble with a
CaseSequence double data type variable that appears
potentially to be a sequencing index, a YYYY-MMM field
character data type, and a KWH double type. It is necessary
to transform this to a tsibble structure for time series analysis.
(ResPower <- readxl::read_xlsx("C:/data/ResidentialCustomerForecastLoad-624.xlsx"))
## # A tibble: 192 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
## 7 739 1998-Jul 8914755
## 8 740 1998-Aug 8607428
## 9 741 1998-Sep 6989888
## 10 742 1998-Oct 6345620
## # ... with 182 more rows
Look at the existing distribution of the original data. There is only one NA value in the KWH, which is not bad.
summary(ResPower)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
The distribution looks somewhat normal, so the one NA could be addressed by imputing the mean of the distribution.
hist(ResPower$KWH)
The NA record in KWH is for the month of Sep, 2008. With only 1 missing value it will be acceptable to impute it to the mean so as not to disturb the original distribution.
(KWH_NA <- subset(ResPower, is.na(ResPower$KWH)))
## # A tibble: 1 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 861 2008-Sep NA
Reformatting the data to create a tsibble with an index
of tsMonth and imputing the 1 NA record in KWH to the mean.
Verifying the shape of the distribution against the original, the mean
imputation did not alter the fundamental distribution.
KWH_ts <- ResPower %>%
mutate(Month = yearmonth(`YYYY-MMM`)) %>%
select(-CaseSequence, -'YYYY-MMM') %>%
mutate(KWH = ifelse(is.na(KWH), mean(KWH, na.rm = TRUE), KWH)) %>% # impute NA value to mean in KWH
as_tsibble(index = Month)
hist(KWH_ts$KWH)
Looking at the time series plot there is an outlier in mid 2010, and overall there appears to be seasonality and a minor upward trend in electricity use. It may be best to use exponential smoothing (ETS) to forecast this data.
KWH_ts %>%
autoplot(KWH) +
labs(y = "KWH", title = "Residential Power Usage 1998 - 2013")
The next task is finding the right ETS model. Checking the additive and multiplicative ETS models, it is difficult visually to see which better forecasts based on the historical seasonality and trend.
fitKWH <- KWH_ts %>%
model(
additive = ETS(KWH ~ error("A") + trend("A") + season("A")),
multiplicative = ETS(KWH ~ error("M") + trend("A") + season("M"))
)
fcKWH <- fitKWH %>% forecast(h = 12)
fcKWH %>%
autoplot(KWH_ts, level = NULL) +
labs(title = "Residential Power Usage 1998 - 2013 with 2014 ETC forecast", y = "KWH(M)") +
facet_grid(.model ~.) +
guides(color = guide_legend(title="Forecast"))
Also attempting damped ETS forecasts with \(\phi = 0.8\) and \(\phi = 0.95\) respectively, these look similar both to each other and also to the multiplicative model. Of course that makes sense since damping only impacts trend, which is minimal in the original data anyway, and the seasonality in the damped models is set to multiplicative just like the multiplicative model.
fitKWH_d <- KWH_ts %>%
model(
damped_0.8 = ETS(KWH ~ error("M") + trend("Ad",phi=0.8) + season("M")),
damped_0.95 = ETS(KWH ~ error("M") + trend("Ad",phi=0.95) + season("M"))
)
fcKWH_d <- fitKWH_d %>% forecast(h = 12)
fcKWH_d %>%
autoplot(KWH_ts, level = NULL) +
labs(title = "Residential Power Usage 1998 - 2013 with 2014 damped forecasts", y = "KWH(M)") +
facet_grid(.model ~.) +
guides(color = guide_legend(title="Forecast"))
But truly overall they look so very similar that it is necessary to check the RMSE for all of the models, which empirically indicates that the additive is the best since it has the lowest RMSE score, though not by a great deal.
fitKWH_A <- KWH_ts %>%
model(
additive = ETS(KWH ~ error("A") + trend("A") + season("A")),
multiplicative = ETS(KWH ~ error("M") + trend("A") + season("M")),
damped_0.8 = ETS(KWH ~ error("M") + trend("Ad",phi=0.8) + season("M")),
damped_0.95 = ETS(KWH ~ error("M") + trend("Ad",phi=0.95) + season("M"))
)
fitKWH_A %>%
accuracy() %>%
select(.model,RMSE)
## # A tibble: 4 x 2
## .model RMSE
## <chr> <dbl>
## 1 additive 826419.
## 2 multiplicative 829670.
## 3 damped_0.8 831984.
## 4 damped_0.95 837110.
Overall then I would go with the additive model as shown below.
fitKWH_Ad <- KWH_ts %>%
model(
additive = ETS(KWH ~ error("A") + trend("A") + season("A"))
)
fcKWH_Ad <- fitKWH_Ad %>% forecast(h = 12)
fcKWH_Ad %>%
autoplot(KWH_ts, level = NULL) +
labs(title = "Residential Power Usage 1998 - 2013 with 2014 ETS forecast", y = "KWH(M)") +
facet_grid(.model ~.) +
guides(color = guide_legend(title="Forecast"))
Validating the residuals for white noise, on the lag plot there are two other points that fall just outside the bounds of insignificance (i.e. they hold at least minor significance) to the model. The residual histogram looks normal with a small number of outliers, so this overall looks like the residuals are simply white noise.
(fitKWH_Ad_rs <- fitKWH_Ad %>%
gg_tsresiduals())
These are two simple 2 columns sets, however they have different time stamps. Time-base sequence the data and aggregate it based on hour. Note for multiple recordings within an hour, take the mean. Then determine if the data is stationary and whether it can be forecast. If so, provide a week forward forecast.
First examine the data.
Water1 <- readxl::read_xlsx("C:/data/Waterflow_Pipe1.xlsx")
Water2 <- readxl::read_xlsx("C:/data/Waterflow_Pipe2.xlsx")
head(Water1,100)
## # A tibble: 100 x 2
## `Date Time` WaterFlow
## <dbl> <dbl>
## 1 42300. 23.4
## 2 42300. 28.0
## 3 42300. 23.1
## 4 42300. 30.0
## 5 42300. 6.00
## 6 42300. 15.9
## 7 42300. 26.6
## 8 42300. 33.3
## 9 42300. 12.4
## 10 42300. 21.8
## # ... with 90 more rows
Transform the dates to Day and Hour, aggregate each flow to the hour taking the mean of each hourly group. When both datasets are grouped to the hour and means are taken, combine them into one tibble, pivot wide, convert to tsibble.
# transform Flow 1
Flow1 <- Water1 %>%
mutate(tsDate = strptime(convertToDateTime(`Date Time`), format = "%Y-%m-%d %H", tz = "EST")) %>%
mutate(tsDay = format(tsDate, format = "%Y-%m-%d")) %>%
mutate(tsHour = format(tsDate, format = "%H:%M:%S")) %>%
mutate(Flow = "Flow1") %>%
group_by(tsDay,tsHour) %>%
mutate(WaterFlow = mean(WaterFlow)) %>%
select(-`Date Time`, -tsDate) %>%
distinct()
# transform Flow 2
Flow2 <- Water2 %>%
mutate(tsDate = strptime(convertToDateTime(`Date Time`), format = "%Y-%m-%d %H", tz = "EST")) %>%
mutate(tsDay = format(tsDate, format = "%Y-%m-%d")) %>%
mutate(tsHour = format(tsDate, format = "%H:%M:%S")) %>%
mutate(Flow = "Flow2") %>%
group_by(tsDay,tsHour) %>%
mutate(WaterFlow = mean(WaterFlow)) %>%
select(-`Date Time`, -tsDate) %>%
distinct()
Flows <- as.data.frame(rbind(Flow1,Flow2)) %>%
mutate(tsDayHour = as_datetime(format(paste(tsDay,tsHour), format = "%Y-%m-%d %H", tz = "EST"))) %>%
select(tsDayHour,Flow,WaterFlow) %>%
pivot_wider(names_from = Flow, values_from = WaterFlow) %>%
as_tsibble(index = tsDayHour)
Having time-base sequenced the data and aggregated it based on hour, let us compare what is in Flow 1 to what is in Flow 2. Flow 1 has no values after midnight on 2015-11-01, while Flow 2 continues to the end of the dataset at 2015-12-03 16:00:00 and is only missing the first value in the series.
autoplot(Flows, .vars = Flow1) + labs(title = "Flow 1")
autoplot(Flows, .vars = Flow2) + labs(title = "Flow 2")
Flow 1 is in serious need of imputation, while Flow 2 needs imputation for only the first value in the series.
summary(Flows)
## tsDayHour Flow1 Flow2
## Min. :2015-10-23 00:00:00 Min. : 8.923 Min. : 1.885
## 1st Qu.:2015-11-02 10:00:00 1st Qu.:17.033 1st Qu.:28.140
## Median :2015-11-12 20:00:00 Median :19.784 Median :39.682
## Mean :2015-11-12 20:00:00 Mean :19.893 Mean :39.556
## 3rd Qu.:2015-11-23 06:00:00 3rd Qu.:22.790 3rd Qu.:50.782
## Max. :2015-12-03 16:00:00 Max. :31.730 Max. :78.303
## NA's :765 NA's :1
Looking at the distributions both Flow 1 and Flow 2 are normally distributed, which means the mean could be used for imputation of each respective Flow.
par(mfrow=c(1,2))
hist(Flows$Flow1)
hist(Flows$Flow2)
After imputing the mean for Flow 1, we recheck the distribution and the plot of the data. The imputation for Flow 1 has failed since the 1st Quantile, Median, Mean, and 3rd Quantile are all now exactly the same. There were too many missing values for mean imputation to work.
Flows_1 <- Flows %>%
mutate(Flow1 = ifelse(is.na(Flow1), mean(Flow1, na.rm = TRUE), Flow1)) %>% # impute NA value to mean in KWH
as_tsibble(index = tsDayHour)
summary(Flows_1)
## tsDayHour Flow1 Flow2
## Min. :2015-10-23 00:00:00 Min. : 8.923 Min. : 1.885
## 1st Qu.:2015-11-02 10:00:00 1st Qu.:19.893 1st Qu.:28.140
## Median :2015-11-12 20:00:00 Median :19.893 Median :39.682
## Mean :2015-11-12 20:00:00 Mean :19.893 Mean :39.556
## 3rd Qu.:2015-11-23 06:00:00 3rd Qu.:19.893 3rd Qu.:50.782
## Max. :2015-12-03 16:00:00 Max. :31.730 Max. :78.303
## NA's :1
And the histogram is also not-so-subtly indicating that the imputation failed as well.
hist(Flows_1$Flow1)
Plotting the updated, imputed Flow 1, we actually can see that the net effect of imputing with the mean for so many missing values effectively resulted in a NAIVE forecast, with all values after the last original value flat lining at the mean.
autoplot(Flows_1, .vars = Flow1)
After imputing the mean for Flow 2, we recheck the distribution and the plot of the data. The imputation for Flow 2 succeeded since the distribution was only very slightly modified.
Flows_2 <- Flows %>%
mutate(Flow2 = ifelse(is.na(Flow2), mean(Flow2, na.rm = TRUE), Flow2)) %>% # impute NA value to mean in KWH
as_tsibble(index = tsDayHour)
summary(Flows_2)
## tsDayHour Flow1 Flow2
## Min. :2015-10-23 00:00:00 Min. : 8.923 Min. : 1.885
## 1st Qu.:2015-11-02 10:00:00 1st Qu.:17.033 1st Qu.:28.172
## Median :2015-11-12 20:00:00 Median :19.784 Median :39.648
## Mean :2015-11-12 20:00:00 Mean :19.893 Mean :39.556
## 3rd Qu.:2015-11-23 06:00:00 3rd Qu.:22.790 3rd Qu.:50.779
## Max. :2015-12-03 16:00:00 Max. :31.730 Max. :78.303
## NA's :765
And the histogram is also essentially unchanged, indicating that the imputation succeeded.
hist(Flows_2$Flow2)
Plotting the updated, imputed Flow 2, the plot is unchanged except for the very first value on the left that now has a value.
autoplot(Flows_2, .vars = Flow2)
Since Flow 1 was not imputed well, there is no sense in checking it for stationarity, but Flow 2 is a possible candidate for forecasting, so we can check it. The acf has a few observations outside the critical bounds, but perhaps few enough to consider it stationary. Further tests are required to confirm this.
Flows_2 %>% gg_tsdisplay(Flow2)
It appears that it is stationary based on the
unitroot_kpss test p-value of 0.1, and if so we can try to
forecast it.
Flows_2 %>% features(Flow2, unitroot_kpss)
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.105 0.1
Additionally, the unitroot_ndiffs test confirms that no
differencing is needed to make the series stationary. So we will proceed
to forecast it one week forward
Flows_2 %>% features(Flow2, unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 0
The ARIMA model can attempt to derive which model would forecast best. It appears that the stepwise and search return the best model with the lowest \(AIC_c\).
fit_ARIMA <- Flows_2 %>%
model(arima110 = ARIMA(Flow2 ~ pdq(1,1,0)), # AR model
arima011 = ARIMA(Flow2 ~ pdq(0,1,1)), # MA model
stepwise = ARIMA(Flow2), # default stepwise
search = ARIMA(Flow2, stepwise = FALSE) # search the model space
)
report(fit_ARIMA)
## # A tibble: 3 x 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 arima110 380. -4388. 8782. 8782. 8797. <cpl [1]> <cpl [24]>
## 2 stepwise 256. -4194. 8394. 8394. 8409. <cpl [0]> <cpl [24]>
## 3 search 256. -4194. 8394. 8394. 8409. <cpl [0]> <cpl [24]>
Checking the residuals of the stepwise model, the residuals look centered around a mean of 0. They also look like white noise based on the ACF and they appear to be normally distributed.
fit_ARIMA %>% select(search) %>% gg_tsresiduals(lag = 168)
The one-week hourly forecast follows the original curve of the data for 24 hours, but after the first day flat lines like a NAIVE forecast.
forecast(fit_ARIMA, h=168) %>%
filter(.model=="search") %>%
autoplot(Flows_2) +
labs(y = "Flow2 ", title = "Flow 2 one-week hourly forecast search model")