Project 1

Part A – ATM Forecast

Forecast how much cash is taken out of 4 different ATM machines for May 2010. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. Explain and demonstrate your process, techniques used and not used, and your actual forecast.

First examine the data. It is a tibble with a DATE field double data type, ATM character type and Cash double data type. Date should be transformed into a true date type, and the distribution of the data across each of the 4 ATMs should be examined.

(ATM <- readxl::read_xlsx("C:/data/ATM624Data.xlsx"))

## # A tibble: 1,474 x 3
##     DATE ATM    Cash
##    <dbl> <chr> <dbl>
##  1 39934 ATM1     96
##  2 39934 ATM2    107
##  3 39935 ATM1     82
##  4 39935 ATM2     89
##  5 39936 ATM1     85
##  6 39936 ATM2     90
##  7 39937 ATM1     90
##  8 39937 ATM2     55
##  9 39938 ATM1     99
## 10 39938 ATM2     79
## # ... with 1,464 more rows

There is quite a bit of cleanup to do. Creating a new dataframe from the imported data, set a new Day variable from the DATE by mutating it as a date with origin of 1900-01-01, pivot the table wider so that each ATM has its own variable of the Cash amount withdrawn from it, remove the original DATE and pivoted NA fields, then filter for only dates prior to May 1, 2010. Lastly creating a tsibble from the newly transformed data.

ATMs <- ATM %>%
  # convert the numeric date to a true Date
  mutate(tsDay = as.Date(DATE, origin = "1899-12-30")) %>%
    # pivot the table from long to wide
    pivot_wider(names_from = ATM, values_from = Cash) %>%
      # remove the original numeric DATE and the pivoted 'NA' variables
      select(-DATE,-"NA") %>%
        # filter the data to not include May 2010, which will be forecast
        filter(tsDay < "2010-05-01") %>%
          # set the data into a tsibble for time series analysis
          as_tsibble(index=tsDay)                                      

head(ATMs)

## # A tsibble: 6 x 5 [1D]
##   tsDay       ATM1  ATM2  ATM3  ATM4
##   <date>     <dbl> <dbl> <dbl> <dbl>
## 1 2009-05-01    96   107     0 777. 
## 2 2009-05-02    82    89     0 524. 
## 3 2009-05-03    85    90     0 793. 
## 4 2009-05-04    90    55     0 908. 
## 5 2009-05-05    99    79     0  52.8
## 6 2009-05-06    88    19     0  52.2

The data is looking more like a time series now, but checking the summary statistics there are still some issues, like NA values for ATM1 and ATM2 and a large outlier in ATM4.

summary(ATMs)

##      tsDay                 ATM1             ATM2             ATM3        
##  Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000  
##  1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 25.50   1st Qu.: 0.0000  
##  Median :2009-10-30   Median : 91.00   Median : 67.00   Median : 0.0000  
##  Mean   :2009-10-30   Mean   : 83.89   Mean   : 62.58   Mean   : 0.7206  
##  3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000  
##  Max.   :2010-04-30   Max.   :180.00   Max.   :147.00   Max.   :96.0000  
##                       NA's   :3        NA's   :2                         
##       ATM4          
##  Min.   :    1.563  
##  1st Qu.:  124.334  
##  Median :  403.839  
##  Mean   :  474.043  
##  3rd Qu.:  704.507  
##  Max.   :10919.762  
##

Since there are only 3 missing values for ATM1 and 2 for ATM2 it is best to impute a value for each respective ATM in order to preserve the structural characteristics of the original data for each machine. After the imputation it is good to re-check the summary structure. In order to find the best method to impute the missing values, we need to look at the distributions of the ATM1 and ATM2 data.

par(mfrow=c(2,2))
ATM1 <- hist(ATMs$ATM1, col = "lightgreen", main = "", xlab = "ATM1", breaks = 10)
ATM2 <- hist(ATMs$ATM2, col = "lightblue", main = "", xlab = "ATM2", breaks = 10)
ATM3 <- hist(ATMs$ATM3, col = "lightyellow", main = "", xlab = "ATM3", breaks = 10)
ATM4 <- hist(ATMs$ATM4, col = "lightpink", main = "", xlab = "ATM4", breaks = 10)

ATM1 is mostly normally distributed except for a large number of very small transaction days, which actually makes it bimodal. ATM2 is closer to being uniformly distributed. ATM3 has practically no transactions at all, although the statistics above confirm that the and Median and 3rd Quartile are $0, while its Mean is $72 over only 3 non-zero transactions. Lastly ATM4 has large outliers that are extremely right skewing the scale of the x-axis.

Since neither of the two ATMs with missing values is completely normally distributed, it is best to impute with the median value. This removes the NA values as seen below, and only decreases the median of ATM by 1 ceteris paribus. ATMs is affected more in the 1st Quantile and Median, but the Mean, 3rd and Max values are unchanged.

ATMs <- ATMs %>%
  mutate(ATM1 = ifelse(is.na(ATM1), median(ATM1, na.rm = TRUE), ATM1)) %>%
  mutate(ATM2 = ifelse(is.na(ATM2), median(ATM2, na.rm = TRUE), ATM2))

summary(ATMs)

##      tsDay                 ATM1             ATM2            ATM3        
##  Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.0   Min.   : 0.0000  
##  1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 26.0   1st Qu.: 0.0000  
##  Median :2009-10-30   Median : 91.00   Median : 67.0   Median : 0.0000  
##  Mean   :2009-10-30   Mean   : 83.95   Mean   : 62.6   Mean   : 0.7206  
##  3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.0   3rd Qu.: 0.0000  
##  Max.   :2010-04-30   Max.   :180.00   Max.   :147.0   Max.   :96.0000  
##       ATM4          
##  Min.   :    1.563  
##  1st Qu.:  124.334  
##  Median :  403.839  
##  Mean   :  474.043  
##  3rd Qu.:  704.507  
##  Max.   :10919.762

Let’s plot the series for each ATM so see if the time series for each ATM reflects its respective distribution.

ATM1 Analysis

ATM1 has no clear trend but extreme seasonality. Looking at the STL decomposition and its residuals would be helpful to clarify what exactly is happening in it.

ATM1 <- ATMs %>%
  select(tsDay,ATM1)

ATM1 %>%
  autoplot(ATM1) +
  labs(title = "ATM1", y = "Cash (H)")

The decomposition confirms that there is no clear upward or downward trend, and there is regular seasonality. What appears to be odd is that it looks like there is so much seasonality that there is some left in the most recent remainder for the last quarter.

ATM1_decomp <- ATM1 %>%
    model(STL(ATM1 ~ trend(window = 24) +
                season(window = "periodic"),
          robust = TRUE))
 
ATM1_decomp  %>%
  components() %>%
    autoplot()

Next the residuals

The gg_tsrediduals plot shows that there is seasonality in the residuals in the last three months or so, and the acf plot confirms that there is significant periodicity in the lags and no stationarity, such as paychecks being deposited weekly on the positive side and two days per week that see more withdrawals than other days. Perhaps a large employer in the area switched to direct deposits in the last quarter? But enough with conjecture…back to the analysis.

ATM1_decomp %>%
  gg_tsresiduals()

ATM1 Forecast

Given that there is not much trend but there is a lot of seasonality, the SNAIVE method may be the best way to forecast the month of May for ATM1. We can see that the seasonality of the days of the week is preserved but the forecast predicts the possibility of negative activity, which is not good for an ATM, and a very very wide confidence interval.

ATM1 %>%
  model(SNAIVE(ATM1)) %>%
    forecast(h = 31) %>%
      autoplot(ATM1) +
      labs(y = "Cash (H)", title = "1-month forecast of ATM1 activity")

Checking Box-Cox for transformation options, a lambda of 0.3406 is closest to 0.5, which implies that a square root transform is in order.

lambda <- ATM1 %>%
  features(ATM1, features = guerrero) %>%
  pull(lambda_guerrero)

ATM1 %>%
autoplot(box_cox(ATM1,lambda)) +
  labs(y = "",
       title = paste0("Transformed ATM1 activity with lambda = ",round(lambda,4)))

Applying the SNAIVE method to a square root transformed ATM1 series re-forecasts the month of May for ATM1 with no negative predictions. We can see that the seasonality of the days of the week is still preserved, but now the forecast predicts a slightly upward drift over the next four weeks (or 31 days), still with a very wide confidence interval, but not as wide as the prior attempt.

ATM1 %>%
  model(SNAIVE(sqrt(ATM1))) %>%
    forecast(h = 31) %>%
      autoplot(ATM1) +
      labs(y = "Cash (H)", title = "1-month forecast of ATM1 activity")

ATM2 Analysis

Onward to ATM2! Here is the time series plot.

ATM2 has a somewhat decreasing overall trend and extreme seasonality similar to ATM1. Again, looking at the STL decomposition and its residuals would be very enlightening.

ATM2 <- ATMs %>%
  select(tsDay,ATM2)

ATM2 %>%
  autoplot(ATM2) +
  labs(title = "ATM2", y = "Cash (H)")

The decomposition confirms the downward trend in this case, slowly declining for three quarters and then taking a dive at the beginning of the last quarter. There is again regular seasonality like with ATM1, and seeing the same seasonality in the most recent remainder is beginning to be convincing of the local employer’s investment in direct deposits.

ATM2_decomp <- ATM2 %>%
    model(STL(ATM2 ~ trend(window = 24) +
                season(window = "periodic"),
          robust = TRUE))
 
ATM2_decomp  %>%
  components() %>%
    autoplot()

The residuals for ATM2 show that there is seasonality present in the last quarter of activity, and the acf confirms that there is significant periodicity in the lag. However, in contrast to ATM1 which had many small transactions, ATM2 sees larger positive transactions. Perhaps management banks at ATM2?

ATM2_decomp %>%
  gg_tsresiduals()

In any event, given that there is both declining trend with a lot of seasonality, the DRIFT method may be the best way to forecast the month of May for ATM2. But given the large dip in the trend it would be wise to look at Box-Cox transformation options before trying to forecast the data.

A lambda of 0.6798 is a little higher than 0.5, which implies that a square root transform is in order.

lambda <- ATM2 %>%
  features(ATM2, features = guerrero) %>%
  pull(lambda_guerrero)

ATM2 %>%
autoplot(box_cox(ATM2,lambda)) +
  labs(y = "",
       title = paste0("Transformed ATM2 activity with lambda = ",round(lambda,4)))

ATM2 Forecast

Applying the DRIFT method, or Random Walk, to a square root transformed ATM2 series forecasts the month of May for ATM2 with a significant upward trend. This looks wrong, so maybe I was wrong about the trend? Let’s try forecasting using the Seasonal Naive method, since the seasonality was so much stronger than the weak apparent trend.

ATM2 %>%
  model(RW(sqrt(ATM2) ~ drift())) %>%
    forecast(h = 31) %>%
      autoplot(ATM2) +
      labs(y = "Cash (H)", title = "1-month forecast of ATM2 activity")

This is much better! Using SNAIVE preserves the seasonality of the days of the week, and the forecast still predicts a slightly upward drift over the next four weeks (or 31 days), again with a very wide confidence interval, but much more rationally believable than the RW forecast attempt.

ATM2 %>%
  model(SNAIVE(sqrt(ATM2))) %>%
    forecast(h = 31) %>%
      autoplot(ATM2) +
      labs(y = "Cash (H)", title = "1-month forecast of ATM2 activity")

ATM3 Analysis

Here is the raw data for ATM3. To be honest, since there are only 3 observations in the data set there is nothing more do to than make a NAIVE forecast and move on to ATM4. Imputation can’t help when there are only 3 values in an entire year.

ATM3 <- ATMs %>%
  select(tsDay,ATM3)

ATM3 %>%
  autoplot(ATM3) +
  labs(title = "ATM3", y = "Cash (H)")

ATM3 Forecast

This forecast makes me think of a satellite dish, or perhaps R2D2’s comms antenna? Beep boop squeee beep boop! (Hey, even if there’s not much to do with it, you still gotta have fun.)

ATM3 %>%
  model(NAIVE(ATM3)) %>%
    forecast(h = 31) %>%
      autoplot(ATM3) +
      labs(y = "Cash (H)", title = "1-month forecast of ATM3 activity")

ATM4 Analysis

The ATM4 time series is plotted last, and there might possibly be an outlier somewhere in this plot (maybe?) Actually, we did verify statistically above that the Maximum transaction for ATM4 was $10,919.76, so we actually already confirmed that there is a very high value in the data. It should be examined more closely.

ATM4 <- ATMs %>%
  select(tsDay,ATM4)

ATM4 %>%
  autoplot(ATM4) +
  labs(title = "ATM4", y = "Cash (H)")

Checking Box-Cox transformation options, a lambda of -0.1229 is closest to 0, which suggests a log transform. A log transform would help to at least reduce the outlier (if there was one) and it would also affect all of the other values similarly.

lambda <- ATM4 %>%
  features(ATM4, features = guerrero) %>%
  pull(lambda_guerrero)

ATM4 %>%
autoplot(box_cox(ATM4,lambda)) +
  labs(y = "",
       title = paste0("Transformed ATM4 activity with lambda = ",round(lambda,4)))

Let us pause to revisit the data for ATM4 and see exactly how many outliers there are and how large they are.

(top10 <- ATM4 %>%
  arrange(desc(ATM4)) %>%
  slice(1:5))

## # A tsibble: 5 x 2 [1D]
##   tsDay        ATM4
##   <date>      <dbl>
## 1 2010-02-09 10920.
## 2 2009-09-22  1712.
## 3 2010-01-29  1575.
## 4 2009-09-04  1495.
## 5 2009-06-09  1484.

There is only one extreme outlier in the top 5 values in the ATM4 transaction totals per day over the past year, and it is more than five times the size of the second largest value. If this were not a time series exercise, it might be possible to remove the value entirely, but it is a large influential part of the overall structure of the ATM4 data and removing an entire day would break the time series. So in order to remediate the outlier in this instance, it might be best to set the largest value equal to the second largest value, which would rescale the outlier without significantly impacting the distribution of the data for ATM4.

Here is the distribution before any modifications.

summary(ATM4)

##      tsDay                 ATM4          
##  Min.   :2009-05-01   Min.   :    1.563  
##  1st Qu.:2009-07-31   1st Qu.:  124.334  
##  Median :2009-10-30   Median :  403.839  
##  Mean   :2009-10-30   Mean   :  474.043  
##  3rd Qu.:2010-01-29   3rd Qu.:  704.507  
##  Max.   :2010-04-30   Max.   :10919.762

Now replace the Max value with the second highest Max value.

highval1 <- ATM4 %>% filter(tsDay == "2010-02-09")  # get the highest value row
highval2 <- ATM4 %>% filter(tsDay == "2009-09-22")  # get the second highest value row

ATM4$ATM4[ATM4$tsDay == highval1$tsDay] <- highval2$ATM4  # replace the highest value with the second highest value

Verify that the highest outlier is now set equal to the second highest outlier. We see that it is.

(top10 <- ATM4 %>%
  arrange(desc(ATM4)) %>%
  slice(1:5))

## # A tsibble: 5 x 2 [1D]
##   tsDay       ATM4
##   <date>     <dbl>
## 1 2009-09-22 1712.
## 2 2010-02-09 1712.
## 3 2010-01-29 1575.
## 4 2009-09-04 1495.
## 5 2009-06-09 1484.

Re-examine the distribution for ATM4. The Mean and the Max decreased, bringing the Mean closer to the Median, but the Median and the rest of the values are unaffected.

summary(ATM4)

##      tsDay                 ATM4         
##  Min.   :2009-05-01   Min.   :   1.563  
##  1st Qu.:2009-07-31   1st Qu.: 124.334  
##  Median :2009-10-30   Median : 403.839  
##  Mean   :2009-10-30   Mean   : 448.817  
##  3rd Qu.:2010-01-29   3rd Qu.: 704.507  
##  Max.   :2010-04-30   Max.   :1712.075

Re-examine the time series plot as well. It appears that the outlier was successfully penalized, so now we can more clearly see that there is no apparent trend and not a even a lot of apparent seasonality, very little of which was visible before.

ATM4 %>%
  autoplot(ATM4) +
  labs(title = "ATM4", y = "Cash (H)")

Last, rechecking Box-Cox transformation options after addressing the one outlier, the lambda is now 0.03965, which is now closest to 0.5, which suggests a square root transform. This is a considerably different recommendation from the log transform with the one outlier.

lambda <- ATM4 %>%
  features(ATM4, features = guerrero) %>%
  pull(lambda_guerrero)

ATM4 %>%
autoplot(box_cox(ATM4,lambda)) +
  labs(y = "",
       title = paste0("Transformed ATM4 activity with lambda = ",round(lambda,4)))

Moving on to the STL decomposition with an applied sqrt transformation, there is clearly no consistent trend, but there is in fact a lot of seasonality which was not visible until the decomposition was taken. This suggests the SNAIVE forecast would work well, similar to ATM1.

ATM4_decomp <- ATM4 %>%
    model(STL(sqrt(ATM4) ~ trend(window = 24) +
                season(window = "periodic"),
          robust = TRUE))
 
ATM4_decomp  %>%
  components() %>%
    autoplot()

ATM4 Forecast

The SNAIVE forecast preserves the seasonality, and we also see a somewhat increasing trend projected into the next month. Perhaps this is in line with what one would expect from the ATM that receives the largest outlying deposits from time to time?

ATM4 %>%
  model(SNAIVE(sqrt(ATM4))) %>%
    forecast(h = 31) %>%
      autoplot(ATM4) +
      labs(y = "Cash (H)", title = "1-month forecast of ATM4 activity")

Part B – Forecasting Power

Power Analysis

Use a simple dataset of residential power usage for January 1998 until December 2013 to model the data and a monthly forecast for 2014. The variable KWH is power consumption in Kilowatt hours, the rest is straight forward.

First examine the data. It is again a tibble with a CaseSequence double data type variable that appears potentially to be a sequencing index, a YYYY-MMM field character data type, and a KWH double type. It is necessary to transform this to a tsibble structure for time series analysis.

(ResPower <- readxl::read_xlsx("C:/data/ResidentialCustomerForecastLoad-624.xlsx"))

## # A tibble: 192 x 3
##    CaseSequence `YYYY-MMM`     KWH
##           <dbl> <chr>        <dbl>
##  1          733 1998-Jan   6862583
##  2          734 1998-Feb   5838198
##  3          735 1998-Mar   5420658
##  4          736 1998-Apr   5010364
##  5          737 1998-May   4665377
##  6          738 1998-Jun   6467147
##  7          739 1998-Jul   8914755
##  8          740 1998-Aug   8607428
##  9          741 1998-Sep   6989888
## 10          742 1998-Oct   6345620
## # ... with 182 more rows

Look at the existing distribution of the original data. There is only one NA value in the KWH, which is not bad.

summary(ResPower)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

The distribution looks somewhat normal, so the one NA could be addressed by imputing the mean of the distribution.

hist(ResPower$KWH)

The NA record in KWH is for the month of Sep, 2008. With only 1 missing value it will be acceptable to impute it to the mean so as not to disturb the original distribution.

(KWH_NA <- subset(ResPower, is.na(ResPower$KWH)))

## # A tibble: 1 x 3
##   CaseSequence `YYYY-MMM`   KWH
##          <dbl> <chr>      <dbl>
## 1          861 2008-Sep      NA

Reformatting the data to create a tsibble with an index of tsMonth and imputing the 1 NA record in KWH to the mean. Verifying the shape of the distribution against the original, the mean imputation did not alter the fundamental distribution.

KWH_ts <- ResPower %>%
  mutate(Month = yearmonth(`YYYY-MMM`)) %>%
    select(-CaseSequence, -'YYYY-MMM') %>%
        mutate(KWH = ifelse(is.na(KWH), mean(KWH, na.rm = TRUE), KWH)) %>%     # impute NA value to mean in KWH
          as_tsibble(index = Month)

hist(KWH_ts$KWH)

Looking at the time series plot there is an outlier in mid 2010, and overall there appears to be seasonality and a minor upward trend in electricity use. It may be best to use exponential smoothing (ETS) to forecast this data.

KWH_ts %>%
  autoplot(KWH) +
  labs(y = "KWH", title = "Residential Power Usage 1998 - 2013")

Power Forecast

The next task is finding the right ETS model. Checking the additive and multiplicative ETS models, it is difficult visually to see which better forecasts based on the historical seasonality and trend.

fitKWH <- KWH_ts %>%
  model(
    additive = ETS(KWH ~ error("A") + trend("A") + season("A")),
    multiplicative = ETS(KWH ~ error("M") + trend("A") + season("M"))
  )

fcKWH <- fitKWH %>% forecast(h = 12)

fcKWH %>%
  autoplot(KWH_ts, level = NULL) +
  labs(title = "Residential Power Usage 1998 - 2013 with 2014 ETC forecast", y = "KWH(M)") +
  facet_grid(.model ~.) +
  guides(color = guide_legend(title="Forecast"))

Also attempting damped ETS forecasts with $\phi = 0.8$ and $\phi = 0.95$ respectively, these look similar both to each other and also to the multiplicative model. Of course that makes sense since damping only impacts trend, which is minimal in the original data anyway, and the seasonality in the damped models is set to multiplicative just like the multiplicative model.

fitKWH_d <- KWH_ts %>%
  model(
    damped_0.8 = ETS(KWH ~ error("M") + trend("Ad",phi=0.8) + season("M")),
    damped_0.95 = ETS(KWH ~ error("M") + trend("Ad",phi=0.95) + season("M"))
  )

fcKWH_d <- fitKWH_d %>% forecast(h = 12)

fcKWH_d %>%
  autoplot(KWH_ts, level = NULL) +
  labs(title = "Residential Power Usage 1998 - 2013 with 2014 damped forecasts", y = "KWH(M)") +
  facet_grid(.model ~.) +
  guides(color = guide_legend(title="Forecast"))

But truly overall they look so very similar that it is necessary to check the RMSE for all of the models, which empirically indicates that the additive is the best since it has the lowest RMSE score, though not by a great deal.

fitKWH_A <- KWH_ts %>%
  model(
    additive = ETS(KWH ~ error("A") + trend("A") + season("A")),
    multiplicative = ETS(KWH ~ error("M") + trend("A") + season("M")),
    damped_0.8 = ETS(KWH ~ error("M") + trend("Ad",phi=0.8) + season("M")),
    damped_0.95 = ETS(KWH ~ error("M") + trend("Ad",phi=0.95) + season("M"))
  )


fitKWH_A %>%
  accuracy() %>%
  select(.model,RMSE)

## # A tibble: 4 x 2
##   .model            RMSE
##   <chr>            <dbl>
## 1 additive       826419.
## 2 multiplicative 829670.
## 3 damped_0.8     831984.
## 4 damped_0.95    837110.

Overall then I would go with the additive model as shown below.

fitKWH_Ad <- KWH_ts %>%
  model(
    additive = ETS(KWH ~ error("A") + trend("A") + season("A"))
  )

fcKWH_Ad <- fitKWH_Ad %>% forecast(h = 12)

fcKWH_Ad %>%
  autoplot(KWH_ts, level = NULL) +
  labs(title = "Residential Power Usage 1998 - 2013 with 2014 ETS forecast", y = "KWH(M)") +
  facet_grid(.model ~.) +
  guides(color = guide_legend(title="Forecast"))

Validating the residuals for white noise, on the lag plot there are two other points that fall just outside the bounds of insignificance (i.e. they hold at least minor significance) to the model. The residual histogram looks normal with a small number of outliers, so this overall looks like the residuals are simply white noise.

(fitKWH_Ad_rs <- fitKWH_Ad %>%
    gg_tsresiduals())

Part C – Multiple Waterflows

Waterflow Analyses

These are two simple 2 columns sets, however they have different time stamps. Time-base sequence the data and aggregate it based on hour. Note for multiple recordings within an hour, take the mean. Then determine if the data is stationary and whether it can be forecast. If so, provide a week forward forecast.

First examine the data.

Water1 <- readxl::read_xlsx("C:/data/Waterflow_Pipe1.xlsx")
Water2 <- readxl::read_xlsx("C:/data/Waterflow_Pipe2.xlsx")

head(Water1,100)

## # A tibble: 100 x 2
##    `Date Time` WaterFlow
##          <dbl>     <dbl>
##  1      42300.     23.4 
##  2      42300.     28.0 
##  3      42300.     23.1 
##  4      42300.     30.0 
##  5      42300.      6.00
##  6      42300.     15.9 
##  7      42300.     26.6 
##  8      42300.     33.3 
##  9      42300.     12.4 
## 10      42300.     21.8 
## # ... with 90 more rows

Transform the dates to Day and Hour, aggregate each flow to the hour taking the mean of each hourly group. When both datasets are grouped to the hour and means are taken, combine them into one tibble, pivot wide, convert to tsibble.

# transform Flow 1
Flow1 <- Water1 %>%
  mutate(tsDate = strptime(convertToDateTime(`Date Time`), format = "%Y-%m-%d %H", tz = "EST")) %>%
    mutate(tsDay = format(tsDate, format = "%Y-%m-%d")) %>%
      mutate(tsHour = format(tsDate, format = "%H:%M:%S")) %>%
        mutate(Flow = "Flow1") %>%
          group_by(tsDay,tsHour) %>%
            mutate(WaterFlow = mean(WaterFlow)) %>%
              select(-`Date Time`, -tsDate) %>%
                distinct()

# transform Flow 2
Flow2 <- Water2 %>%
  mutate(tsDate = strptime(convertToDateTime(`Date Time`), format = "%Y-%m-%d %H", tz = "EST")) %>%
    mutate(tsDay = format(tsDate, format = "%Y-%m-%d")) %>%
      mutate(tsHour = format(tsDate, format = "%H:%M:%S")) %>%
        mutate(Flow = "Flow2") %>%
          group_by(tsDay,tsHour) %>%
            mutate(WaterFlow = mean(WaterFlow)) %>%
              select(-`Date Time`, -tsDate) %>%
                distinct() 

Flows <- as.data.frame(rbind(Flow1,Flow2)) %>%
  mutate(tsDayHour = as_datetime(format(paste(tsDay,tsHour), format = "%Y-%m-%d %H", tz = "EST"))) %>%
    select(tsDayHour,Flow,WaterFlow) %>%
      pivot_wider(names_from = Flow, values_from = WaterFlow) %>%
        as_tsibble(index = tsDayHour)

Having time-base sequenced the data and aggregated it based on hour, let us compare what is in Flow 1 to what is in Flow 2. Flow 1 has no values after midnight on 2015-11-01, while Flow 2 continues to the end of the dataset at 2015-12-03 16:00:00 and is only missing the first value in the series.

autoplot(Flows, .vars = Flow1) + labs(title = "Flow 1")

autoplot(Flows, .vars = Flow2) + labs(title = "Flow 2")

Flow 1 is in serious need of imputation, while Flow 2 needs imputation for only the first value in the series.

summary(Flows)

##    tsDayHour                       Flow1            Flow2       
##  Min.   :2015-10-23 00:00:00   Min.   : 8.923   Min.   : 1.885  
##  1st Qu.:2015-11-02 10:00:00   1st Qu.:17.033   1st Qu.:28.140  
##  Median :2015-11-12 20:00:00   Median :19.784   Median :39.682  
##  Mean   :2015-11-12 20:00:00   Mean   :19.893   Mean   :39.556  
##  3rd Qu.:2015-11-23 06:00:00   3rd Qu.:22.790   3rd Qu.:50.782  
##  Max.   :2015-12-03 16:00:00   Max.   :31.730   Max.   :78.303  
##                                NA's   :765      NA's   :1

Looking at the distributions both Flow 1 and Flow 2 are normally distributed, which means the mean could be used for imputation of each respective Flow.

par(mfrow=c(1,2))
hist(Flows$Flow1)
hist(Flows$Flow2)

After imputing the mean for Flow 1, we recheck the distribution and the plot of the data. The imputation for Flow 1 has failed since the 1st Quantile, Median, Mean, and 3rd Quantile are all now exactly the same. There were too many missing values for mean imputation to work.

Flows_1 <- Flows %>%
  mutate(Flow1 = ifelse(is.na(Flow1), mean(Flow1, na.rm = TRUE), Flow1)) %>%     # impute NA value to mean in KWH
    as_tsibble(index = tsDayHour)

summary(Flows_1)

##    tsDayHour                       Flow1            Flow2       
##  Min.   :2015-10-23 00:00:00   Min.   : 8.923   Min.   : 1.885  
##  1st Qu.:2015-11-02 10:00:00   1st Qu.:19.893   1st Qu.:28.140  
##  Median :2015-11-12 20:00:00   Median :19.893   Median :39.682  
##  Mean   :2015-11-12 20:00:00   Mean   :19.893   Mean   :39.556  
##  3rd Qu.:2015-11-23 06:00:00   3rd Qu.:19.893   3rd Qu.:50.782  
##  Max.   :2015-12-03 16:00:00   Max.   :31.730   Max.   :78.303  
##                                                 NA's   :1

And the histogram is also not-so-subtly indicating that the imputation failed as well.

hist(Flows_1$Flow1)

Plotting the updated, imputed Flow 1, we actually can see that the net effect of imputing with the mean for so many missing values effectively resulted in a NAIVE forecast, with all values after the last original value flat lining at the mean.

autoplot(Flows_1, .vars = Flow1)

After imputing the mean for Flow 2, we recheck the distribution and the plot of the data. The imputation for Flow 2 succeeded since the distribution was only very slightly modified.

Flows_2 <- Flows %>%
  mutate(Flow2 = ifelse(is.na(Flow2), mean(Flow2, na.rm = TRUE), Flow2)) %>%     # impute NA value to mean in KWH
    as_tsibble(index = tsDayHour)

summary(Flows_2)

##    tsDayHour                       Flow1            Flow2       
##  Min.   :2015-10-23 00:00:00   Min.   : 8.923   Min.   : 1.885  
##  1st Qu.:2015-11-02 10:00:00   1st Qu.:17.033   1st Qu.:28.172  
##  Median :2015-11-12 20:00:00   Median :19.784   Median :39.648  
##  Mean   :2015-11-12 20:00:00   Mean   :19.893   Mean   :39.556  
##  3rd Qu.:2015-11-23 06:00:00   3rd Qu.:22.790   3rd Qu.:50.779  
##  Max.   :2015-12-03 16:00:00   Max.   :31.730   Max.   :78.303  
##                                NA's   :765

And the histogram is also essentially unchanged, indicating that the imputation succeeded.

hist(Flows_2$Flow2)

Plotting the updated, imputed Flow 2, the plot is unchanged except for the very first value on the left that now has a value.

autoplot(Flows_2, .vars = Flow2)

Since Flow 1 was not imputed well, there is no sense in checking it for stationarity, but Flow 2 is a possible candidate for forecasting, so we can check it. The acf has a few observations outside the critical bounds, but perhaps few enough to consider it stationary. Further tests are required to confirm this.

Flows_2 %>% gg_tsdisplay(Flow2)

It appears that it is stationary based on the unitroot_kpss test p-value of 0.1, and if so we can try to forecast it.

Flows_2 %>% features(Flow2, unitroot_kpss)

## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.105         0.1

Additionally, the unitroot_ndiffs test confirms that no differencing is needed to make the series stationary. So we will proceed to forecast it one week forward

Flows_2 %>% features(Flow2, unitroot_ndiffs)

## # A tibble: 1 x 1
##   ndiffs
##    <int>
## 1      0

Waterflow Forecast

Flow 2 ARIMA forecast

The ARIMA model can attempt to derive which model would forecast best. It appears that the stepwise and search return the best model with the lowest $AIC_c$.

fit_ARIMA <- Flows_2 %>%
  model(arima110 = ARIMA(Flow2 ~ pdq(1,1,0)),     # AR model
        arima011 = ARIMA(Flow2 ~ pdq(0,1,1)),     # MA model
        stepwise = ARIMA(Flow2),                  # default stepwise
        search = ARIMA(Flow2, stepwise = FALSE)   # search the model space
        )

report(fit_ARIMA)

## # A tibble: 3 x 8
##   .model   sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots  
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>    
## 1 arima110   380.  -4388. 8782. 8782. 8797. <cpl [1]> <cpl [24]>
## 2 stepwise   256.  -4194. 8394. 8394. 8409. <cpl [0]> <cpl [24]>
## 3 search     256.  -4194. 8394. 8394. 8409. <cpl [0]> <cpl [24]>

Checking the residuals of the stepwise model, the residuals look centered around a mean of 0. They also look like white noise based on the ACF and they appear to be normally distributed.

fit_ARIMA %>% select(search) %>% gg_tsresiduals(lag = 168)

The one-week hourly forecast follows the original curve of the data for 24 hours, but after the first day flat lines like a NAIVE forecast.

forecast(fit_ARIMA, h=168) %>%
  filter(.model=="search") %>%
  autoplot(Flows_2) +
  labs(y = "Flow2 ", title = "Flow 2 one-week hourly forecast search model")

Data 624 Predictive Analytics

Douglas Barley

3/24/2022