DATA 624: Project 1

Author

Ana Collado

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Oct 25. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A: ATM Forecast

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

load packages

library(fpp3)
library(tidyverse)
library(naniar)
library(forecast)
library(feasts)
library(zoo)
library(urca)

The Setup:

Before we embark on forecasting, I’ll take a look at the data to determine a course of action. First, I’ll use clean_names() for uniformity. Immediately upon importing I’ll clean the data. There are several errors.

atm_df<- readxl::read_xlsx("ATM624Data.xlsx")
atm_df<- janitor::clean_names(atm_df)
atm_df$atm<- as.factor(atm_df$atm)

the dates

The date column is in number format and its data points are not valid dates. Upon researching I learned dates on Excel may be stored as “serial numbers”. This would be easily fixed by selecting Format in Excel, but we are working in R. To see the date range (and confirm the range in the Part A description, I opened the file in Excel and modified the serial format. The dates range from 2009 to 2010.
Screenshot of Serial Formatted Dates

Changing the date using the regular as_date() modifies the dates incorrectly from 2079 to 2080. I don’t believe I should assume 2079 is meant to be 2009 and 2080 is 2010. So, I dove a little deeper, I learned the serial numbers are used across the board in order to help mathematical functions on dates. ¹

The base function as.date() may be used for this conversion because we can specify “origin”. Without digressing into the Microsoft lore, Excel assigned December 30 1899 as the origin.

corrected dates

atm_df$date<- as.Date.numeric(atm_df$date, origin = "1899-12-30")

the missing data

The next issue are dates missing data points. We can’t always assume NA data points are actual missing data points but in this case I am inclined to delete these. There might be circumstances in which these data points do exist in an earlier version of the spreadsheet. In a perfect world I might endevour to Frankenstein those relevant data points onto this particular dataset. But, I did check the spreadsheet for version history and it was a dead end.

In examining the data in the original format xlsx I feel confident the NA are actual missing values. 2 weeks worth. It seems the missing data can’t represent $0 transactions because they exist in the data and they exist attached to an ATM. These values should be removed. I will delete the records with NA ATM and Cash values.

Since we’re here, we should probably examine the data a little deeper for NA or weird values. The date column will require a separate df with a sequence of all dates between the min and max dates of the dataset. I’ll use that to list to pull a count of missing days.

missing values

#-- Cash NA Count
atm_df %>% 
  group_by(cash) %>% 
  summarise(n= n()) %>% 
  arrange(desc(n))

# A tibble: 510 × 2
    cash     n
   <dbl> <int>
 1     0   364
 2    NA    19
 3    85    17
 4    90    16
 5    73    14
 6     9    13
 7    78    13
 8    96    13
 9     2    12
10    91    12
# ℹ 500 more rows

#-- ATM NA Count
atm_df %>% 
  group_by(atm) %>% 
  summarise(n= n()) %>% 
  arrange(desc(n))

# A tibble: 5 × 2
  atm       n
  <fct> <int>
1 ATM1    365
2 ATM2    365
3 ATM3    365
4 ATM4    365
5 <NA>     14

#-- Date NA Count
all_expected_dates<- seq(
  from= as.Date("2009-05-01"), to= as.Date("2010-05-14"), by="day")

setdiff(all_expected_dates, atm_df$DATE)

There are 19 NA cash values and 14 NA ATM values. Luckily there are no missing dates! We can take a closer look at the actual lines here:

na-sidebar

na_atm_df<- atm_df %>% 
  filter(is.na(atm)|is.na(cash))
na_atm_df

# A tibble: 19 × 3
   date       atm    cash
   <date>     <fct> <dbl>
 1 2009-06-13 ATM1     NA
 2 2009-06-16 ATM1     NA
 3 2009-06-18 ATM2     NA
 4 2009-06-22 ATM1     NA
 5 2009-06-24 ATM2     NA
 6 2010-05-01 <NA>     NA
 7 2010-05-02 <NA>     NA
 8 2010-05-03 <NA>     NA
 9 2010-05-04 <NA>     NA
10 2010-05-05 <NA>     NA
11 2010-05-06 <NA>     NA
12 2010-05-07 <NA>     NA
13 2010-05-08 <NA>     NA
14 2010-05-09 <NA>     NA
15 2010-05-10 <NA>     NA
16 2010-05-11 <NA>     NA
17 2010-05-12 <NA>     NA
18 2010-05-13 <NA>     NA
19 2010-05-14 <NA>     NA

At first, I debated removing all rows with NA but in truth it feels illegal. I could probably minimize row removal by using na.interp() atleast on those rows that are only missing atm$cash.

Removing these would leave ATM2 with 363 days of data and ATM1 with 362 days of data so to maintain 365 records for all ATMs I’ll estimate the missing points with na.interp().

remove na

atm_df<- atm_df %>% 
  filter(!is.na(atm))

na.interp()

atm_df<- atm_df %>%
  mutate(cash= na.interp(cash))

atm_df<- as_tsibble(atm_df, key = atm, index = date)

Now that I’m believe the data is clean, I’m curious to know I’ll visualize the distribution of its spread.

distribution

atm_df%>% 
  ggplot(aes(cash))+
  geom_histogram(binwidth = 5, bins = 20,color= "black", fill= "gray")+
  facet_wrap(vars(atm), scales= "free", ncol = 1)+
  scale_x_continuous()+
  labs(title = "Distribution of Cash by ATM")+
  theme_bw()

Plotting raises ATM3 as a problem. There are a million reasons why its daily record would be mostly zero. Was the ATM in need of repair? If the data is said to be in “hundreds of dollars” are zeros total withdrawls that don’t reach $100? Or are these real zeros? Is the ATM placed in the middle of nowhere? Is it next to a bank? ( I’d rather walk to my nearest bank to withdraw cash than pay a fee for withdrawal at a random ATM)

The only values over zero on ATM3 are consecutive dates.

autoplot(atm_df, .vars= cash)+
  theme_bw()

ATM4 is also very skewed I will normalize the data using log()

log_atm<- atm_df %>% 
  mutate(cash_log= log1p(cash))

summary(log_atm)

      date              atm           cash            cash_log     
 Min.   :2009-05-01   ATM1:365   Min.   :    0.0   Min.   :0.0000  
 1st Qu.:2009-07-31   ATM2:365   1st Qu.:    1.0   1st Qu.:0.6931  
 Median :2009-10-30   ATM3:365   Median :   73.5   Median :4.3108  
 Mean   :2009-10-30   ATM4:365   Mean   :  155.3   Mean   :3.3989  
 3rd Qu.:2010-01-29              3rd Qu.:  114.0   3rd Qu.:4.7449  
 Max.   :2010-04-30              Max.   :10919.8   Max.   :9.2984

autoplot(log_atm, .vars = cash_log)+
  theme_bw()

The resulting plot is looks noisy to the eye. Separating them would be best before I make any more decisions.

log_atm %>% 
  autoplot()+
  facet_wrap(~atm, scales = "free_y", ncol = 1)

ATM 1 and 2 have a pattern going on, repeated daily—with some slight variation. ATM3 is the problem child and ATM4 is relatively stable except for that outlier that’s effecting the visual, even after log. I think the pattern is weekly but to confirm visually, I added a week scale to the autoplot. The tick marks match the right points of the lines but I wondered how I could make this more visually appealing.

I searched up what day of the week our minimum value was– a Friday. From there, added 6 days to that and the seventh day would be a blank space on the overlay and the beginning of the next sequence of days. I ended up confusing myself really, because while it did seem that many dips fell right into the blank space, not all of them did.

weekly bounds

weekly_bounds<- tibble(weekstart= seq(from= as.Date("2009-05-01"), unit= "week", to= as.Date("2010-04-30"), unit= "week", by= "week"))
weekly_bounds<- weekly_bounds %>% 
  mutate(weekend= weekstart+ 6)
                       
atm_df %>%
   ggplot(aes(x = date, y = cash))+
   geom_rect(data = weekly_bounds, 
             aes(xmin = weekstart, xmax = weekend, ymin = -Inf, 
                 ymax = Inf), inherit.aes = FALSE, 
             fill = "gray", alpha = 0.35, color = NA)+
  geom_line()+
  facet_wrap(~ atm, scales = "free_y", ncol = 1) +
  scale_x_date(date_labels = "%Y-%m-%d", date_breaks = "2 months")+
  theme_bw()

Nevertheless, I think its safe to say these are not simply linear. I will approach ATM1 and ATM2 with seasonal models. First I’ll run the lamdas before models.

lambdas

atm_df %>%
  filter(atm== "ATM1") %>% 
  features(.var = cash, features = "guerrero") %>% 
  pull(lambda_guerrero)

[1] 0.2377346

atm_df %>%
  filter(atm== "ATM2") %>% 
  features(.var = cash, features = "guerrero") %>% 
  pull(lambda_guerrero)

[1] 0.7178414

atm_df %>%
  filter(atm== "ATM3") %>% 
  features(.var = cash, features = "guerrero") %>% 
  pull(lambda_guerrero)

[1] 1.365421

atm_df %>%
  filter(atm== "ATM4") %>% 
  features(.var = cash, features = "guerrero") %>% 
  pull(lambda_guerrero)

[1] -0.0737252

The Forecast:

So from the values produced it seems ATM1 should be log transformed, ATM2 should be is fine as it is, ATM3 shouldn’t be transformed and ATM4 should be transformed.

atm forecasts

lam1= 0.2377346

ATM1<- atm_df %>% 
  filter(atm== "ATM1") %>% 
  mutate(logged= log1p(cash))

#-- Was having errors so I'm formatting as tsibble here
ATM1<- as_tsibble(ATM1)

model1<- ATM1 %>% 
  model(SNAIVE= SNAIVE(logged ~ lag("week")),
        BoxCox= ETS(box_cox(logged, lam1)),
        ARIMA_Box= ARIMA(box_cox(logged, lam1)))

f.model1<- model1 %>% forecast(h= "30 days")

autoplot(ATM1, .vars = logged)+
  autolayer(f.model1, level = NULL)+
  theme_bw()+
  ggtitle("ATM1 Cash Withdrawal Forecast", subtitle = "May 2010")

Using augment() the first week is NA as expected but the fitted values and residuals are identical.

#-- Residual Diagnostics SNAIVE
mod1test<- ATM1 %>% 
  model(SNAIVE= SNAIVE(logged ~ lag("week")))

mod1test %>% augment()

# A tsibble: 365 x 7 [1D]
# Key:       atm, .model [1]
   atm   .model date       logged .fitted  .resid  .innov
   <fct> <chr>  <date>      <dbl>   <dbl>   <dbl>   <dbl>
 1 ATM1  SNAIVE 2009-05-01   4.57   NA    NA      NA     
 2 ATM1  SNAIVE 2009-05-02   4.42   NA    NA      NA     
 3 ATM1  SNAIVE 2009-05-03   4.45   NA    NA      NA     
 4 ATM1  SNAIVE 2009-05-04   4.51   NA    NA      NA     
 5 ATM1  SNAIVE 2009-05-05   4.61   NA    NA      NA     
 6 ATM1  SNAIVE 2009-05-06   4.49   NA    NA      NA     
 7 ATM1  SNAIVE 2009-05-07   2.20   NA    NA      NA     
 8 ATM1  SNAIVE 2009-05-08   4.65    4.57  0.0792  0.0792
 9 ATM1  SNAIVE 2009-05-09   4.48    4.42  0.0585  0.0585
10 ATM1  SNAIVE 2009-05-10   4.54    4.45  0.0889  0.0889
# ℹ 355 more rows

#-- Residual Diagnostics BOXANA
mod1test2<- ATM1 %>% 
  model(BoxCox= ETS(cash ~ error("A")+
              trend("N")+
              season("A")))

mod1test2 %>% augment()

# A tsibble: 365 x 7 [1D]
# Key:       atm, .model [1]
   atm   .model date        cash .fitted .resid .innov
   <fct> <chr>  <date>     <dbl>   <dbl>  <dbl>  <dbl>
 1 ATM1  BoxCox 2009-05-01    96    91.4   4.61   4.61
 2 ATM1  BoxCox 2009-05-02    82    83.2  -1.18  -1.18
 3 ATM1  BoxCox 2009-05-03    85    90.4  -5.40  -5.40
 4 ATM1  BoxCox 2009-05-04    90    85.9   4.14   4.14
 5 ATM1  BoxCox 2009-05-05    99    88.3  10.7   10.7 
 6 ATM1  BoxCox 2009-05-06    88    80.1   7.86   7.86
 7 ATM1  BoxCox 2009-05-07     8    17.9  -9.87  -9.87
 8 ATM1  BoxCox 2009-05-08   104    93.1  10.9   10.9 
 9 ATM1  BoxCox 2009-05-09    87    83.1   3.94   3.94
10 ATM1  BoxCox 2009-05-10    93    89.0   4.02   4.02
# ℹ 355 more rows

#-- Residual Diagnostics BOXARIMA
mod1test3<- ATM1 %>% 
  model(ARIMA(box_cox(logged, lam1)))

mod1test3 %>% augment()

# A tsibble: 365 x 7 [1D]
# Key:       atm, .model [1]
   atm   .model                       date       logged .fitted  .resid   .innov
   <fct> <chr>                        <date>      <dbl>   <dbl>   <dbl>    <dbl>
 1 ATM1  ARIMA(box_cox(logged, lam1)) 2009-05-01   4.57    4.57 0.00583 0.00183 
 2 ATM1  ARIMA(box_cox(logged, lam1)) 2009-05-02   4.42    4.41 0.00553 0.00178 
 3 ATM1  ARIMA(box_cox(logged, lam1)) 2009-05-03   4.45    4.45 0.00560 0.00179 
 4 ATM1  ARIMA(box_cox(logged, lam1)) 2009-05-04   4.51    4.51 0.00571 0.00181 
 5 ATM1  ARIMA(box_cox(logged, lam1)) 2009-05-05   4.61    4.60 0.00589 0.00184 
 6 ATM1  ARIMA(box_cox(logged, lam1)) 2009-05-06   4.49    4.48 0.00567 0.00180 
 7 ATM1  ARIMA(box_cox(logged, lam1)) 2009-05-07   2.20    2.20 0.00158 0.000866
 8 ATM1  ARIMA(box_cox(logged, lam1)) 2009-05-08   4.65    4.59 0.0653  0.0203  
 9 ATM1  ARIMA(box_cox(logged, lam1)) 2009-05-09   4.48    4.43 0.0482  0.0154  
10 ATM1  ARIMA(box_cox(logged, lam1)) 2009-05-10   4.54    4.47 0.0734  0.0233  
# ℹ 355 more rows

lam2= 0.7178414

ATM2<- atm_df %>% 
  filter(atm== "ATM2")
  
ATM2<- ATM2 %>% as_tsibble(index = date)

model2<- ATM2 %>% 
  model(SNAIVE= SNAIVE(cash ~ lag("week")),
        BoxCox= ETS(cash ~ error("A")+
              trend("N")+
              season("A")))

f.model2<- model2 %>% forecast(h= "30 days")

autoplot(ATM2, .vars = cash)+
  autolayer(f.model2, level = NULL)+
  theme_bw()+
  ggtitle("ATM2 Cash Withdrawal Forecast", subtitle = "May 2010")

SNAIVE seems a good predictor for ATM1 and ATM2

ATM1 %>% 
  ACF(logged) %>% 
  autoplot()

ATM2 %>% 
  ACF(cash) %>% 
  autoplot()

I think SNAIVE did a good job in ATM1 and ATM2 forecasting. Now moving on to the harder parts!

ATM3<- atm_df %>% 
  filter(atm== "ATM3")

ATM3<- ATM3 %>% as_tsibble(index = date)

ATM3 %>% 
  ACF(cash) %>% 
  autoplot()

ts.ATM3<- ts(ATM3$cash, frequency = 7)
auto.arima(ts.ATM3, ic= "aic")

Series: ts.ATM3 
ARIMA(0,0,2) with zero mean 

Coefficients:
         ma1     ma2
      0.8392  0.8557
s.e.  0.0496  0.0611

sigma^2 = 25.4:  log likelihood = -1108.69
AIC=2223.39   AICc=2223.46   BIC=2235.09

The pdq provided for the $p$ (autoregressive) part is zero, the $d$ (differencing) is zero and the $q$ (moving avg) is 2. ²

I also decided to add the pdq(1,0,0) because Hyndman mentions it will show a single significant spike which this data does have. He specifies at “lag 12” technically, but I’m curious anyway.

model3<- ATM3 %>% 
  model(autoARIMApdq= ARIMA(cash ~ pdq(0,0,2)),
        ARIMAhyndmanpdq= ARIMA(cash ~ pdq(1,0,0)),
        ExpSm= ETS(cash),
        GenARIMA= ARIMA(cash))

f.model3<- model3 %>% forecast(h= "30 days")

autoplot(ATM3, .vars = cash)+
  autolayer(f.model3, level = NULL)+
  theme_bw()+
  ggtitle("ATM3 Cash Withdrawal Forecast", subtitle = "May 2010")

That looks better than I thought it would. My knee jerk reaction would be to tell the stakeholders there is not enough information on ATM3 to include it in forecasting. The foretasted plot intuitively looks reasonable to me in that the spike comes to a sharp decline but if its presented to stakeholders it would need a large disclaimer.

model3 %>% 
  select(autoARIMApdq) %>% 
  gg_tsresiduals()

model3 %>% 
  select(ARIMAhyndmanpdq) %>% 
  gg_tsresiduals()

The residuals plot just doesn’t give me much confidence. This spike is 0.008% of the data. there is no ACF significance. It simply doesn’t make sense to use the outliers for forecasting here. In removing these outliers the data is simply showing $0. I think this data is a dud. But I would be interested in seeing what other colleagues would make of it.

On to ATM4. . .

ATM4<- atm_df %>% 
  filter(atm== "ATM4")

ATM4<- ATM4 %>% as_tsibble(index = date)


model4<- ATM4 %>% 
  model(SNAIVE= SNAIVE(cash ~ lag("week")),
        ANA= ETS(cash ~ error("A")+
              trend("N")+
              season("A")),
        GenARIMA= ARIMA(cash))

f.model4<- model4 %>% forecast(h= "30 days")

autoplot(ATM4, .vars = cash)+
  autolayer(f.model4, level = NULL)+
  theme_bw()+
  ggtitle("ATM4 Cash Withdrawal Forecast", subtitle = "May 2010")

model4 %>% 
  select(SNAIVE) %>% 
  gg_tsresiduals()

model4 %>% 
  select(ANA) %>% 
  gg_tsresiduals()

model4 %>% 
  select(GenARIMA) %>% 
  gg_tsresiduals()

My untrained eye says ARIMA is the better fit model, ANA being in second place, but the spike is the problem. Yet, I’m not inclined to remove this outlier data in ATM4. But in looking at the autocorrelation plot I learned the ATM4 timeseries is white noise. Hydman states the lines in ACF will stay within the blue bands if we have white noise series.³

None of the bands reach the blue lines, not even with the outliers. I think I’ll stick to the ARIMA forecast.

model4<- ATM4 %>% 
  model(GenARIMA= ARIMA(cash))

f.model4<- model4 %>% forecast(h= "30 days")

autoplot(ATM4, .vars = cash)+
  autolayer(f.model4, level = NULL)+
  theme_bw()+
  ggtitle("ATM4 Cash Withdrawal Forecast Using ARIMA", subtitle = "May 2010")

ATM_forecast1<- as_tibble(forecast(model1))
ATM_forecast2<- as_tibble(forecast(model2))
ATM_forecast3<- as_tibble(forecast(model3))
ATM_forecast4<- as_tibble(forecast(model4))

readr::write_excel_csv(ATM_forecast1, "ATM_forecast1.csv")
readr::write_excel_csv(ATM_forecast2, "ATM_forecast2.csv")
readr::write_excel_csv(ATM_forecast3, "ATM_forecast3.csv")
readr::write_excel_csv(ATM_forecast4, "ATM_forecast4.csv")

Click here for ATM Excel Files

Part B: Forecasting Power

Data: ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.

The Setup:

residential<- readxl::read_xlsx("residential.xlsx")
residential<- janitor::clean_names(residential)
residential<- residential %>%
  rename(date= yyyy_mmm)

residential$date<- yearmonth(residential$date)

na values

residential %>%
  filter(is.na(date))

# A tibble: 0 × 3
# ℹ 3 variables: case_sequence <dbl>, date <mth>, kwh <dbl>

residential<- as_tsibble(residential, index= date)

Since I’m working with electricity data, I have some understandable assumptions of some seasonality. But I will keep in mind that not all places have much of a season change. Information about the geographical source of the data is also not disclosed so I’ll approach data with a little more carefully

plot

autoplot(residential, .vars = kwh)

The auto plot shows the data gathered (for the most part) in that horizontal area with annual-ish peaks. The end seems to have a slight upward direction, but I’m not sure about the stability/significance of it.

Spectral Entropy score before I go on with forecasting— out of curiosity!

feat_spectral(residential$kwh)

spectral_entropy 
       0.6278332

I’ve just realized there is a missing datapoint and upon scrolling back up I didn’t check the $kwh column for na, only $date. So, I’ll fix that now.

ts-residential

residential %>%
  filter(is.na(kwh))

# A tsibble: 1 x 3 [1M]
  case_sequence     date   kwh
          <dbl>    <mth> <dbl>
1           861 2008 Sep    NA

ts.residential<- ts(data = residential$kwh, start = 1998, frequency = 12)

ts.residential<- ts.residential %>% 
  na.interp()

ggseasonplot(ts.residential, year.labels = TRUE)+
  labs(title = "Residential Power Usage 1998-2013", caption = "Seasonal Plot")

residential %>%
  ggplot(aes(month(date, label= TRUE, abbr = TRUE),kwh,group = year(date), color= (year(date))))+ geom_line(size= 1.0, alpha= 0.8)+scale_color_viridis_b()+ labs(title = "Residential Power Usage 1998-2013")+
           theme_bw()+theme(axis.title.x.bottom = element_blank(),
                            legend.title = element_blank())

The Forecast:

Before I attempt to forecast I will use KPSS to see if the data is stationary⁴. The test will help me determine if differencing is needed aka if the series is stationary or not.

ur.kpss()

ts.residential %>% 
  ur.kpss() %>% 
  summary()


####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 4 lags. 

Value of test-statistic is: 1.298 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

The test statistic 1.2806 is higher than 0.739 so the data is not stationary. The ACF PCF plots supports that because the lines go beyond the blue lines, there is a pattern of peaks at 12, 24, and 36. So I’ll difference the data and check again.

ggtsdisplay(ts.residential)

ts.residential %>% 
  diff() %>% 
  ur.kpss() %>% 
  summary()


####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 4 lags. 

Value of test-statistic is: 0.0428 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

Now I can move on to forecasting

ndiffs(ts.residential)

[1] 1

model

lam<- BoxCox.lambda(ts.residential, method= "guerrero")
lam

[1] 0.1193072

f.ts.residential<- 
  auto.arima(ts.residential,lambda = lam, seasonal = TRUE,
             approximation = FALSE)

h=12
modelB<- forecast(f.ts.residential, h = h, lambda = lam)
autoplot(modelB)+
  ggtitle("Residential Power Usage Forecast Using ARIMA",
          subtitle = "for 2014")+
  ylab("kWh")

residential_forecast<- as_tibble(forecast(modelB))
readr::write_excel_csv(residential_forecast, "residential_forecast.csv")

Click here for Residential Excel Forecast

Part C: Bonus

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows).

Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

The Setup:

waterflow1<- readxl::read_xlsx("Waterflow_Pipe1.xlsx")
waterflow2<- readxl::read_xlsx("Waterflow_Pipe2.xlsx")

waterflow1<- janitor::clean_names(waterflow1)
waterflow2<- janitor::clean_names(waterflow2)

full_join()

waterflow<- full_join(x = waterflow1, y = waterflow2, by = "date_time")
rm(waterflow1, waterflow2)

waterflow<- waterflow %>% 
  rename(water_flow= water_flow.x) %>% 
  mutate(water_flow= coalesce(water_flow, water_flow.y)) %>% 
  select(date_time, water_flow)

After hours of trying to make the date format work. I found a StackOverflow that gave me the missing piece, multiplying by 84600 number of seconds in a day. I created a placeholder column so I can convert the original column to POSIXct and then extract what I needed each in its own column and then merge them. I don’t know why regular datetime function didnt work for me but this is my work around.

time conversion- the struggle

waterflow <- waterflow %>%
  mutate(dt_placeholder=
           as.POSIXct(date_time * 86400, origin = "1899-12-30"),
         date = date(dt_placeholder)) %>%
  group_by(date, format(dt_placeholder, "%H")) %>% 
  mutate(hourly_mean= mean(water_flow))

final<- waterflow %>% 
  select(date, hourly_mean)

final<- final %>% 
  distinct(date,hourly_mean, .keep_all = FALSE) %>% 
  mutate(hour= `format(dt_placeholder, "%H")`)

rm(waterflow)


final<- final %>%
  mutate(hour= as.numeric(hour))

final <- final %>%
  mutate(datetime = ymd_h(paste(date, hour)))


ftsibble<- final %>% 
  as_tsibble(index = datetime)

ftsibble %>% 
  ggplot(aes(datetime, hourly_mean))+
  geom_line()+
  theme_bw()+
  ggtitle("Averge Hourly Waterflow")

The Forecast:

waterflow<- ftsibble %>% 
  pull(hourly_mean)

waterflow %>% 
  ur.kpss() %>% 
  summary()


####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 6 lags. 

Value of test-statistic is: 4.9607 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

waterflow %>% 
  diff() %>% 
  ur.kpss() %>% 
  summary


####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 6 lags. 

Value of test-statistic is: 0.0111 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

lam<- BoxCox.lambda(waterflow,method= "guerrero")
lam

[1] 0.3784534

ggtsdisplay(waterflow)

forecast

water_ts<- ts(waterflow, frequency = 24)

f.water_ts<- 
  auto.arima(water_ts,lambda = lam, seasonal = TRUE)

h=24

model<- forecast(f.water_ts, h = h, lambda = lam)
autoplot(model, fcol = "pink")+
  theme_bw()+
  ggtitle("Averge Hourly Waterflow Forecast")

waterflow_forecast<- as_tibble(forecast(model))
readr::write_excel_csv(waterflow_forecast, "waterflow_forecast.csv")

Click here to download the forecast excel spreadsheet

Bibliography

Why 30th December 1899. https://www.iwpcug.org/docs/18991230.php. Accessed 22 Oct. 2025.

Horst, Nicholas Tierney & Allison. The Missing Book. 7 Apr. 2022, https://tmb.njtierney.com/.

Marek, Reto. 11.1 Overlapping | Energy Data Analysis with R. hslu-ige-laes.github.io, https://hslu-ige-laes.github.io/edar/overlapping.html. Accessed 20 Oct. 2025.

Footnotes

Why 30th December 1899. https://www.iwpcug.org/docs/18991230.php. Accessed 23 Oct. 2025.↩︎
9.9 Seasonal ARIMA Models: “An ARIMA(0,0,0)(1,0,0). . .”↩︎
2.9: White Noise https://otexts.com/fpp3/wn.html ↩︎
8.1 Stationarity and differencing: https://otexts.com/fpp2/stationarity.html ↩︎