Questions

Question 1

I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Introduction

atm <- readxl::read_excel('./ATM624Data.xlsx') %>%
  mutate(DATE = as.Date(DATE, origin='1899-12-30'))

atm %>%
  summary()

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

There are 19 missing cash values. Examining them reveals that there are 5 missing values and 14 completely empty rows. The 14 empty rows can be removed as they are extraneous but the 5 cases will need to be imputed.

atm %>%
  filter(!complete.cases(.)) %>%
  DT::datatable()

I decided to impute the 5 missing values by using the median value for each of ATM’s remaining values. Below exploration revealed that each ATM has significantly different values so I felt it was better to impute only on the corresponding ATM’s values.

atm <- atm %>%
  filter(!(is.na(.$ATM) & is.na(.$Cash)))

medians <- atm %>%
  filter(!is.na(Cash)) %>%
  group_by(ATM) %>%
  summarise(med = median(Cash))

atm[is.na(atm$Cash) & atm$ATM == 'ATM1', ]$Cash <- medians$med[medians$ATM == 'ATM1'][1] 
atm[is.na(atm$Cash) & atm$ATM == 'ATM2', ]$Cash <- medians$med[medians$ATM == 'ATM2'][1]

Examining the largest values reveals 1 value that is an order of magnitude larger than any other value in the data set. This value is either a mistake (perhaps a typo) or it is a valid value but wholly unrepresentative of typical ATM usage. Perhaps an extraordinary event occurred that day. This value can be adjusted by setting it to the next largest value from the same ATM. This will aid modeling by not distracting the algorithm with an obvious outlier.

atm %>%
  arrange(desc(Cash)) %>%
  head(10)

## # A tibble: 10 x 3
##    DATE       ATM     Cash
##    <date>     <chr>  <dbl>
##  1 2010-02-09 ATM4  10920.
##  2 2009-09-22 ATM4   1712.
##  3 2010-01-29 ATM4   1575.
##  4 2009-09-04 ATM4   1495.
##  5 2009-06-09 ATM4   1484.
##  6 2009-09-05 ATM4   1301.
##  7 2010-02-28 ATM4   1276.
##  8 2009-08-23 ATM4   1246.
##  9 2009-06-14 ATM4   1221.
## 10 2009-09-29 ATM4   1195.

n <- atm$Cash %>%
  na.omit() %>%
  length()

temp <- atm$Cash %>%
  na.omit() %>%
  sort(partial=c(n-1, n))

largest.2 <- temp[n-1]
largest <- temp[n]

atm <- atm %>%
  mutate(Cash = ifelse(Cash == largest, largest.2, Cash))

The data is now cleaned and ready for analysis and modelling.

atm %>%
  summary()

##       DATE                ATM                 Cash     
##  Min.   :2009-05-01   Length:1460        Min.   :   0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:   1  
##  Median :2009-10-30   Mode  :character   Median :  73  
##  Mean   :2009-10-30                      Mean   : 149  
##  3rd Qu.:2010-01-29                      3rd Qu.: 114  
##  Max.   :2010-04-30                      Max.   :1712

The below plot shows that each of the ATMs have very different distributions. ATM1 appears to have cyclic withdrawals that are not being captured by a simple linear regression. ATM2 and ATM4 appear to have more traditionally consistant withdrawl values. Finally, ATM3 appears to not have existed prior to late April.

Considering the difference in distributions for the 4 ATM are, I will separate each value and create a separate model for each one.

atm %>%
  ggplot(aes(DATE, Cash, color=ATM)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~ATM, scales='free') +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(x="", y="")

atm.1 <- atm %>%
  filter(ATM == 'ATM1')
atm.2 <- atm %>%
  filter(ATM == 'ATM2')
atm.3 <- atm %>%
  filter(ATM == 'ATM3')
atm.4 <- atm %>%
  filter(ATM == 'ATM4')

ATM1

As noted previously, there appears to be some cyclicity to the cash withdrawal at ATM 1. The patterm becomes clear when faceting the data by day of the week. Sunday, Monday, Wednesday, Friday and Saturday appear incredibly consistant across the year. Tuesday and Thursday on the other hand are bit more difficult to unravel.

For most of the year Thursday’s withdrawal rate were very low up until they joined the other days of the week around March. At the exact same time, the rates of Tuesday’s withdrawal rates plummetted to about the levels previously seen on Thursday. I cannot speculate why this is the case. However, the trend has continued long enough that I believe that it will remain this way through at least the end of the predictive period.

atm.1 <- atm.1 %>%
  mutate(day_of_week = factor(format(atm.1$DATE, "%a"), levels=c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'))) 
atm.1 %>%
  ggplot(aes(DATE, Cash)) +
  geom_point() +
  facet_wrap(~day_of_week) + 
  theme_bw() +
  labs(x="", y="")

Using the entirety of the data set to make predictions seems unwise in this case. Not only does the older data appear unrepresentative of the current trends, it will be difficult for any decomposition function to decern this change in seasonality. For this reason, I am going to subset the data and find a trend based only on the more recent data. This trend began the week of 02/21/2010.

The risk of doing this is that there is a yearly seasonality where March and April withdrawal rates change and then revert. There simply isn’t enough data to draw any insight. With that being said, the sizeable change in such a short period of time in lue of any underlying reason would indicate to me that this is relatively permenent change.

This data appears much easier to identify a trend.

subset.atm.1 <- atm.1 %>%
  filter(DATE >= as.POSIXct('2010-02-21'))

subset.atm.1 %>%
  ggplot(aes(DATE, Cash)) +
  geom_point(aes(color=day_of_week)) +
  geom_line(aes(alpha=0.2), show.legen=FALSE) +
  labs(x="", y="", color='') +
  scale_color_brewer(palette = 'Dark2') +
  scale_y_continuous(limit=c(0, 180)) +
  theme_bw() +
  theme(legend.background = element_rect(fill='white', color='black'),
        legend.key = element_rect(fill='grey80'),
        legend.position = c(0.7, 0.88),
        legend.direction = 'horizontal')

forecast.atm.1 <- subset.atm.1$Cash %>%
  ts(frequency=7, start=1) %>%
  hw(h=31, damped=TRUE, level=c(80))

The predicted May values are:

forecast.atm.1$mean

## Time Series:
## Start = c(10, 7) 
## End = c(15, 2) 
## Frequency = 7 
##  [1]  84.1601124 106.0858919  66.4953373   0.8646835  98.8164126
##  [6]  83.1075115  84.6762672  83.9575492 105.8928075  66.3112881
## [11]   0.6892467  98.6491853  82.9481095  84.5243242  83.8127163
## [16] 105.7547520  66.1796928   0.5638094  98.5296177  82.8341370
## [21]  84.4156849  83.7091608 105.6560422  66.0856021   0.4741215
## [26]  98.4441267  82.7526465  84.3380078  83.6351184 105.5854646
## [31]  66.0183271

The plotted predictions appear reasonable:

forecast.atm.1 %>%
  autoplot() +
  theme_bw() +
  labs(x='Time Series', y='Cash', title='') +
  theme(legend.position='none',
        panel.border = element_blank())

ATM2

Similar to ATM 1, the distribution of values removed from the ATM appear to suddenly change in the past two months. Specifically Monday, Tuesday, Wedneday and Thursday see fundamental shifts to their values. Sunday and Friday also see changes although to a less extent. In order to capture this more recent trend, I will once again subset the more recent data.

atm.2 <- atm.2 %>%
  mutate(day_of_week = factor(format(atm.1$DATE, "%a"), levels=c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')))

atm.2 %>%
  ggplot(aes(DATE, Cash)) +
  geom_point() +
  facet_wrap(~day_of_week) +
  theme_bw() +
  labs(x="", y="")

This data appears much easier to find a trend.

subset.atm.2 <- atm.2 %>%
  filter(DATE >= as.POSIXct('2010-02-21'))

subset.atm.2 %>%
  ggplot(aes(DATE, Cash)) +
  geom_point(aes(color=day_of_week)) +
  geom_line(aes(alpha=0.2), show.legend=FALSE) +
  labs(x="", y="", color='') +
  scale_color_brewer(palette = 'Dark2') +
  scale_y_continuous(limit=c(0, 150)) +
  theme_bw() +
  theme(legend.background = element_rect(fill='white', color='black'),
        legend.key = element_rect(fill='grey80'),
        legend.position = c(0.7, 0.88),
        legend.direction = 'horizontal')

forecast.atm.2 <- subset.atm.2$Cash %>%
  ts(frequency=7, start=1) %>%
  hw(h=31, damped=TRUE, level=c(80))

The predicted May values are:

forecast.atm.2$mean

## Time Series:
## Start = c(10, 7) 
## End = c(15, 2) 
## Frequency = 7 
##  [1]  69.337179  82.683535  14.448620   5.081073 105.458137  94.948565
##  [7]  53.345016  69.820002  83.155757  14.910473   5.532785 105.899931
## [13]  95.380659  53.767623  70.233329  83.560009  15.305850   5.919481
## [19] 106.278136  95.750559  54.129401  70.587164  83.906075  15.644318
## [25]   6.250517 106.601903  96.067218  54.439107  70.890070  84.202331
## [31]  15.934068

The plotted predictions appear reasonable:

forecast.atm.2 %>%
  autoplot() +
  theme_bw() +
  labs(x='Time Series', y='Cash', title='') +
  theme(legend.position='none',
        panel.border = element_blank())

ATM3

ATM 3 is simultaniously the easiest and most challenging prediction to make. Examining the plot there are only three available values. It is unclear why all the earlier values are 0 but it is safe to assume that the ATM either wasn’t open and usable at this time.

atm.3 %>%
  ggplot(aes(DATE, Cash)) +
  geom_point() +
  theme_bw() +
  labs(x='', y='')

Three values is simply not enough to make any reasonable prediction. There are really only two choices that can be made at this stage. The first would be to inform my boss that the lack of data means that results for this ATM are essentially unreliable and that I would not feel comfortable making decisions based on any model developed based on so few points. Of course, this may not be an available option.

Thus, the second choice would be to find the ATM that is most similar to this one and use those values as my predictions. Given a more robust data set, we could select ATMs that are similar based on previously established relationships. With this data set however, we have limited information to go on.

Comparing the only three values in ATM 3 to the other ATMs last three values it is clear that ATM 1 is the most similar to ATM 3. In fact, the values are identical!

atm %>%
  filter(DATE >= as.POSIXct('2010-04-28')) %>%
  spread(key='ATM', value='Cash') %>%
  DT::datatable()

Thus, I will use ATM 1 to model ATM 3 and reuse the predictions made for ATM 1 for ATM 3. It should be noted that ATM 2 is also incredibly similar value-wise to ATM 3 and an argument could be made for using the median of ATM 1 and ATM 2’s values. However, with as little data as we have for ATM 3 I belive the decision to be arbitrary.

I believe with just a bit more data we can determine which model would be best for ATM 3 and thus would suggest revisiting the ATM in a week or two.

forecast.atm.3 <- forecast.atm.1

forecast.atm.3$mean

## Time Series:
## Start = c(10, 7) 
## End = c(15, 2) 
## Frequency = 7 
##  [1]  84.1601124 106.0858919  66.4953373   0.8646835  98.8164126
##  [6]  83.1075115  84.6762672  83.9575492 105.8928075  66.3112881
## [11]   0.6892467  98.6491853  82.9481095  84.5243242  83.8127163
## [16] 105.7547520  66.1796928   0.5638094  98.5296177  82.8341370
## [21]  84.4156849  83.7091608 105.6560422  66.0856021   0.4741215
## [26]  98.4441267  82.7526465  84.3380078  83.6351184 105.5854646
## [31]  66.0183271

ATM4

Once again, there is a fundamental change with the ATM usage beginning in the past two months. Like with ATM 1 and ATM 2 I will subset the data in order to better identify the recent trend.

atm.4 <- atm.4 %>%
  mutate(day_of_week = factor(format(atm.1$DATE, "%a"), levels=c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')))

atm.4 %>%
  ggplot(aes(DATE, Cash)) +
  geom_point() +
  facet_wrap(~day_of_week) +
  theme_bw() +
  labs(x="", y="")

This data is easier to interpret, however it also appears to be heteroscedastic. That is, the changes in the earlier data is larger than that of the later data. This can cause difficult for the decomposition. I will attempt a BoxCox transformation to address this issue.

subset.atm.4 <- atm.4 %>%
  filter(DATE >= as.POSIXct('2010-02-21'))

subset.atm.4 %>%
  ggplot(aes(DATE, Cash)) +
  geom_point(aes(color=day_of_week)) +
  geom_line(aes(alpha=0.2), show.legend=FALSE) +
  labs(x="", y="", color='') +
  scale_color_brewer(palette = 'Dark2') +
  scale_y_continuous(limit=c(0, 1300)) +
  theme_bw() +
  theme(legend.background = element_rect(fill='white', color='black'),
        legend.key = element_rect(fill='grey80'),
        legend.position = c(0.7, 0.88),
        legend.direction = 'horizontal')

A boxcox transformation of 0.5 is equivalent to taking the square root of the variable. I will do this manually. This transformation appears to have addressed the heteroscedasticity of the data.

BoxCox.lambda(subset.atm.4$Cash)

## [1] 0.5064559

subset.atm.4 <- subset.atm.4 %>%
  mutate(sCash = sqrt(Cash)) 

subset.atm.4 %>%
  ggplot(aes(DATE, sCash)) +
  geom_point(aes(color=day_of_week)) +
  geom_line(aes(alpha=0.2), show.legend=FALSE) +
  labs(x="", y="", color='') +
  scale_color_brewer(palette = 'Dark2') +
  scale_y_continuous(limit=c(0, 40)) +
  theme_bw() +
  theme(legend.background = element_rect(fill='white', color='black'),
        legend.key = element_rect(fill='grey80'),
        legend.position = c(0.7, 0.88),
        legend.direction = 'horizontal')

Making predictions for this dataset proved to be quite difficult. Exploring the data I discovered that it is stationary. Some exploratory plots indicated a weekly autocorrelation but findfrequency indicated seasonality of 48 hours. However, this frequency only appears on the transformed data. Different forecasting methods resulted in widely different results indicating that no single method was able to get a strong grasp on the data.

subset.atm.4$sCash %>%
  findfrequency()

## [1] 2

I ended up using the stlf function for my forecasts. This function seasonally adjusts the data and then applys a non-seasonal forecast on the data. The final results are obtained by re-seasonalizing the predictions. This produced a fair compromise in my data predictions. The predicted values are:

forecast.atm.4 <- subset.atm.4$sCash %>%
  ts(frequency=7, start=1) %>%
  stlf(31)

forecast.atm.4$mean ** 2

## Time Series:
## Start = c(10, 7) 
## End = c(15, 2) 
## Frequency = 7 
##  [1] 312.5372 374.9730 386.1335  22.8269 579.4351 407.1489 592.3919
##  [8] 312.5372 374.9730 386.1335  22.8269 579.4351 407.1489 592.3919
## [15] 312.5372 374.9730 386.1335  22.8269 579.4351 407.1489 592.3919
## [22] 312.5372 374.9730 386.1335  22.8269 579.4351 407.1489 592.3919
## [29] 312.5372 374.9730 386.1335

forecast.atm.4 %>%
  autoplot() +
  theme_bw() +
  labs(x='Time Series', y='Cash', title='') +
  theme(legend.position='none',
        panel.border = element_blank())

Solutions

I will now plot all the past and predictive data and write it out to a csv file.

predictions <- data_frame(DATE = seq(as.Date('2010-05-01'), by='day', length.out=31),
           ATM1 = forecast.atm.1$mean, 
           ATM2 = forecast.atm.2$mean,
           ATM3 = forecast.atm.3$mean,
           ATM4 = forecast.atm.4$mean**2) %>%
  gather(key=ATM, value=Cash, -DATE) %>%
  mutate(prediction = TRUE)

atm %>%
  mutate(prediction = FALSE) %>%
  rbind(predictions) %>%
  ggplot(aes(DATE, Cash, color=prediction)) +
  geom_point() +
  facet_wrap(~ATM, scales='free') +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(x="", y="")

predictions %>%
  select(-prediction) %>%
  write_csv('./question-1-predictions.csv')

Question 2

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

power <- readxl::read_excel('./ResidentialCustomerForecastLoad-624.xlsx') 

power %>%
  summary()

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

First I will remove the CaseSequence column and convert the Date column so that it is easier to use.

power <- power %>%
  rename(Date = `YYYY-MMM`) %>%
  select(-CaseSequence) %>%
  mutate(Date = as.Date(paste0('01-', Date), '%d-%Y-%b'))

power %>%
  DT::datatable()

The above summary indicates two issues that need to be addressed, both of which can be seen in the following plot of the data.

power %>%
  ggplot(aes(Date, KWH)) +
  geom_line() +
  theme_bw()

The first issue is the single value whose KWH is an order of magnitude smaller than any other value in the data. Exploration of the surrounding values makes it seem likely that a digit was left off the data. In order to bring this value in line with the rest of the data I will multiply it by 10.

power %>%
  filter(format(Date, '%m') == '07')

## # A tibble: 16 x 2
##    Date            KWH
##    <date>        <dbl>
##  1 1998-07-01  8914755
##  2 1999-07-01  7444416
##  3 2000-07-01  6862662
##  4 2001-07-01  7855129
##  5 2002-07-01  7039702
##  6 2003-07-01  6900676
##  7 2004-07-01  7307931
##  8 2005-07-01  8337998
##  9 2006-07-01  7945564
## 10 2007-07-01  6426220
## 11 2008-07-01  7643987
## 12 2009-07-01  7713260
## 13 2010-07-01   770523
## 14 2011-07-01 10093343
## 15 2012-07-01  8886851
## 16 2013-07-01  8415321

The second issue is the missing value for September 2008. There are a number of ways to impute this data but I have decided to find the average of the previous month and the following month and use that value. It is apparent from the data that KWH is highly seasonal and thus I do not want to disturb this for the single missing value.

power %>%
  slice(128:130)

## # A tibble: 3 x 2
##   Date           KWH
##   <date>       <dbl>
## 1 2008-08-01 8037137
## 2 2008-09-01      NA
## 3 2008-10-01 5101803

power <- power %>%
  mutate(KWH = ifelse(KWH == 770523, KWH * 10, KWH)) %>%
  mutate(KWH = ifelse(is.na(KWH), (8037137+5101803)/2, KWH))

power %>% 
  summary()

##       Date                 KWH          
##  Min.   :1998-01-01   Min.   : 4313019  
##  1st Qu.:2001-12-24   1st Qu.: 5443502  
##  Median :2005-12-16   Median : 6351262  
##  Mean   :2005-12-15   Mean   : 6538942  
##  3rd Qu.:2009-12-08   3rd Qu.: 7649733  
##  Max.   :2013-12-01   Max.   :10655730

power %>%
  ggplot(aes(Date, KWH)) +
  geom_line() +
  theme_bw()

From viewing the data it is apparent that there is a consistent seasonal pattern that we can model. I decided to split the data into a train and test set in order to examine a variety of different models.

train <-power$KWH %>%
  ts(frequency=12, start=c(1998, 1)) %>%
  subset(end=160)
test <- power$KWH %>%
  ts(frequency=12, start=c(1998, 1)) %>%
  subset(start=161)

My selected method utilizes auto.arima and has it explore the full parameter space. The residuals for this model indicate a valid fit.

power.arima <- train %>% 
  auto.arima(approximation=FALSE, stepwise=FALSE, lambda='auto') 

power.arima %>%
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,3)(2,1,0)[12] with drift
## Q* = 16.118, df = 18, p-value = 0.5843
## 
## Model df: 6.   Total lags used: 24

Plotting the predictions along with the test data shows a strong fit.

power.arima %>%
  forecast(32, level=c(0)) %>%
  autoplot() +
  autolayer(test)

With my model selected I will retrain the data on the full dataset and make my forecasts.

predictions <- power$KWH %>%
  auto.arima(approximation=FALSE, stepwise=FALSE, lambda='auto') %>%
  forecast(12, level=c(0))

predictions %>%
  autoplot() +
  theme_bw() +
  labs(x='', y='KWH')

Finally I write out my predictions.

future.dates <- power$Date %>%
  tail(1) %>%
  seq.Date(length=12, by='1 month')

to.write <- data_frame(CaseSequence=924:935, `YYYY-MMMM`=future.dates, KWH=predictions$mean) 

to.write %>%
  write_csv('./question-2-predictions.csv')

to.write %>%
  DT::datatable()

Question 3

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

I begin by merging the two data sets. I will need to also convert the date time to something human readable for ease of processing.

waterflow.1 <- readxl::read_excel('./Waterflow_Pipe1.xlsx')
waterflow.2 <- readxl::read_excel('./Waterflow_Pipe1.xlsx')

waterflow <- rbind(waterflow.1, waterflow.2)

I convert the date time column and separate it into two columns. The data needs to be aggregated based on hour so it makes it easier to separate that out as a different column.

waterflow <- waterflow %>%
  mutate(`Date Time` = as.POSIXct(`Date Time`*60*60*24, origin='1899-12-30'),
         Date = lubridate::date(`Date Time`),
         Hour = lubridate::hour(`Date Time`)) %>%
  select(Date, Hour, WaterFlow) %>%
  arrange(Date, Hour)

waterflow %>%
  DT::datatable()

As per the directions, I find the mean for when I have multiple rows representing the same hour.

waterflow <- waterflow %>%
  group_by(Date, Hour) %>%
  summarize(WaterFlow = mean(WaterFlow))

waterflow %>%
  DT::datatable()

There are three hours that have missing values. I will impute them with the median of all the other values.

missing <- data_frame(Date=c(as.Date('2015-10-27'), as.Date('2015-10-31'), as.Date('2015-10-27')), Hour=c(17, 22, 10), WaterFlow=rep(median(waterflow$WaterFlow), 3))

waterflow <- waterflow %>%
  bind_rows(missing) %>%
  arrange(Date, Hour)

As per the directions, I need to determine whether the data is stationary or needs to be modified to be made stationary.

I examined 3 difference sets of data, all of which support the fact that the data is stationary. ndiffs suggests the data needs to be differenced 0 times. The kpss test’s p-value is suitably large and the ggtsdisplay shows no spikes in the ACF or PACF plot. In conclusion, the data is stationary.

waterflow$WaterFlow %>%
  ndiffs()

## [1] 0

waterflow$WaterFlow %>%
  ggtsdisplay()

waterflow$WaterFlow %>%
  ts(frequency=24, start=c(1, 17)) %>%
  urca::ur.kpss()

## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.2447

Although the data is stationary, that does not mean it will be easy to predict. I created a number of different plots in order to determine whether there was some exploitable pattern to the data that could be used to model it. As seen in the below plot, there is no dicernable pattern when considering Hour of Day or Day of Week.

waterflow %>%
  ungroup() %>%
  mutate(dow = as.factor(as.POSIXlt(Date)$wday)) %>%
  ggplot(aes(dow, WaterFlow)) +
  geom_boxplot() + 
  geom_point(alpha=0.2) +
  theme_bw() +
  scale_x_discrete(labels=c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')) +
  labs(x='Day of Week') +
  theme(panel.border = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank())

waterflow %>%
  ggplot(aes(Hour, WaterFlow, color=as.factor(Date))) +
  geom_point() +
  geom_line(aes(group=Date)) +
  labs(color='Date') +
  theme_bw() +
  theme(legend.background = element_rect(color='black'))

With no discernable, exploitable pattern and the need to make predictions far into the future relative to the size of the training data, I have decided to damp the data. This moves the data back towards a consistent value instead of following the trend indefinitely into the future.

fit <- waterflow$WaterFlow %>%
  ts(frequency=24, start=17) %>%
  hw(h=168, damped=TRUE, level=c(80)) 

fit %>%
  autoplot() +
  labs(x='', y='Waterflow') +
  theme_bw() +
  theme(legend.position='none')

Finally, I will write out the data to a csv file.

origin <- as.POSIXct('1899-12-30', format='%Y-%m-%d')
from <- waterflow %>%
  tail(1) %>%
  mutate(dt = paste0(Date, 'T', Hour)) %>%
  .$dt %>%
  as.POSIXct(format="%Y-%m-%dT%H") + 9*60*60

dt <- seq.POSIXt(from, by="hour", length.out=168) %>%
  difftime(origin, units=c('days')) %>%
  as.numeric()
predictions <- data_frame(`Date Time` = dt, WaterFlow = fit$mean)

predictions %>%
  write_csv('./question-3-predictions.csv')

predictions %>%
  DT::datatable()

Project 1

Brian Weinfeld

March 19, 2019