I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
atm <- readxl::read_excel('./ATM624Data.xlsx') %>%
mutate(DATE = as.Date(DATE, origin='1899-12-30'))
atm %>%
summary()
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
There are 19 missing cash values. Examining them reveals that there are 5 missing values and 14 completely empty rows. The 14 empty rows can be removed as they are extraneous but the 5 cases will need to be imputed.
atm %>%
filter(!complete.cases(.)) %>%
DT::datatable()
I decided to impute the 5 missing values by using the median value for each of ATM’s remaining values. Below exploration revealed that each ATM has significantly different values so I felt it was better to impute only on the corresponding ATM’s values.
atm <- atm %>%
filter(!(is.na(.$ATM) & is.na(.$Cash)))
medians <- atm %>%
filter(!is.na(Cash)) %>%
group_by(ATM) %>%
summarise(med = median(Cash))
atm[is.na(atm$Cash) & atm$ATM == 'ATM1', ]$Cash <- medians$med[medians$ATM == 'ATM1'][1]
atm[is.na(atm$Cash) & atm$ATM == 'ATM2', ]$Cash <- medians$med[medians$ATM == 'ATM2'][1]
Examining the largest values reveals 1 value that is an order of magnitude larger than any other value in the data set. This value is either a mistake (perhaps a typo) or it is a valid value but wholly unrepresentative of typical ATM usage. Perhaps an extraordinary event occurred that day. This value can be adjusted by setting it to the next largest value from the same ATM. This will aid modeling by not distracting the algorithm with an obvious outlier.
atm %>%
arrange(desc(Cash)) %>%
head(10)
## # A tibble: 10 x 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-02-09 ATM4 10920.
## 2 2009-09-22 ATM4 1712.
## 3 2010-01-29 ATM4 1575.
## 4 2009-09-04 ATM4 1495.
## 5 2009-06-09 ATM4 1484.
## 6 2009-09-05 ATM4 1301.
## 7 2010-02-28 ATM4 1276.
## 8 2009-08-23 ATM4 1246.
## 9 2009-06-14 ATM4 1221.
## 10 2009-09-29 ATM4 1195.
n <- atm$Cash %>%
na.omit() %>%
length()
temp <- atm$Cash %>%
na.omit() %>%
sort(partial=c(n-1, n))
largest.2 <- temp[n-1]
largest <- temp[n]
atm <- atm %>%
mutate(Cash = ifelse(Cash == largest, largest.2, Cash))
The data is now cleaned and ready for analysis and modelling.
atm %>%
summary()
## DATE ATM Cash
## Min. :2009-05-01 Length:1460 Min. : 0
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 1
## Median :2009-10-30 Mode :character Median : 73
## Mean :2009-10-30 Mean : 149
## 3rd Qu.:2010-01-29 3rd Qu.: 114
## Max. :2010-04-30 Max. :1712
The below plot shows that each of the ATMs have very different distributions. ATM1 appears to have cyclic withdrawals that are not being captured by a simple linear regression. ATM2 and ATM4 appear to have more traditionally consistant withdrawl values. Finally, ATM3 appears to not have existed prior to late April.
Considering the difference in distributions for the 4 ATM are, I will separate each value and create a separate model for each one.
atm %>%
ggplot(aes(DATE, Cash, color=ATM)) +
geom_point() +
geom_smooth() +
facet_wrap(~ATM, scales='free') +
theme_bw() +
theme(legend.position = 'none') +
labs(x="", y="")
atm.1 <- atm %>%
filter(ATM == 'ATM1')
atm.2 <- atm %>%
filter(ATM == 'ATM2')
atm.3 <- atm %>%
filter(ATM == 'ATM3')
atm.4 <- atm %>%
filter(ATM == 'ATM4')
As noted previously, there appears to be some cyclicity to the cash withdrawal at ATM 1. The patterm becomes clear when faceting the data by day of the week. Sunday, Monday, Wednesday, Friday and Saturday appear incredibly consistant across the year. Tuesday and Thursday on the other hand are bit more difficult to unravel.
For most of the year Thursday’s withdrawal rate were very low up until they joined the other days of the week around March. At the exact same time, the rates of Tuesday’s withdrawal rates plummetted to about the levels previously seen on Thursday. I cannot speculate why this is the case. However, the trend has continued long enough that I believe that it will remain this way through at least the end of the predictive period.
atm.1 <- atm.1 %>%
mutate(day_of_week = factor(format(atm.1$DATE, "%a"), levels=c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')))
atm.1 %>%
ggplot(aes(DATE, Cash)) +
geom_point() +
facet_wrap(~day_of_week) +
theme_bw() +
labs(x="", y="")
Using the entirety of the data set to make predictions seems unwise in this case. Not only does the older data appear unrepresentative of the current trends, it will be difficult for any decomposition function to decern this change in seasonality. For this reason, I am going to subset the data and find a trend based only on the more recent data. This trend began the week of 02/21/2010.
The risk of doing this is that there is a yearly seasonality where March and April withdrawal rates change and then revert. There simply isn’t enough data to draw any insight. With that being said, the sizeable change in such a short period of time in lue of any underlying reason would indicate to me that this is relatively permenent change.
This data appears much easier to identify a trend.
subset.atm.1 <- atm.1 %>%
filter(DATE >= as.POSIXct('2010-02-21'))
subset.atm.1 %>%
ggplot(aes(DATE, Cash)) +
geom_point(aes(color=day_of_week)) +
geom_line(aes(alpha=0.2), show.legen=FALSE) +
labs(x="", y="", color='') +
scale_color_brewer(palette = 'Dark2') +
scale_y_continuous(limit=c(0, 180)) +
theme_bw() +
theme(legend.background = element_rect(fill='white', color='black'),
legend.key = element_rect(fill='grey80'),
legend.position = c(0.7, 0.88),
legend.direction = 'horizontal')
forecast.atm.1 <- subset.atm.1$Cash %>%
ts(frequency=7, start=1) %>%
hw(h=31, damped=TRUE, level=c(80))
The predicted May values are:
forecast.atm.1$mean
## Time Series:
## Start = c(10, 7)
## End = c(15, 2)
## Frequency = 7
## [1] 84.1601124 106.0858919 66.4953373 0.8646835 98.8164126
## [6] 83.1075115 84.6762672 83.9575492 105.8928075 66.3112881
## [11] 0.6892467 98.6491853 82.9481095 84.5243242 83.8127163
## [16] 105.7547520 66.1796928 0.5638094 98.5296177 82.8341370
## [21] 84.4156849 83.7091608 105.6560422 66.0856021 0.4741215
## [26] 98.4441267 82.7526465 84.3380078 83.6351184 105.5854646
## [31] 66.0183271
The plotted predictions appear reasonable:
forecast.atm.1 %>%
autoplot() +
theme_bw() +
labs(x='Time Series', y='Cash', title='') +
theme(legend.position='none',
panel.border = element_blank())
Similar to ATM 1, the distribution of values removed from the ATM appear to suddenly change in the past two months. Specifically Monday, Tuesday, Wedneday and Thursday see fundamental shifts to their values. Sunday and Friday also see changes although to a less extent. In order to capture this more recent trend, I will once again subset the more recent data.
atm.2 <- atm.2 %>%
mutate(day_of_week = factor(format(atm.1$DATE, "%a"), levels=c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')))
atm.2 %>%
ggplot(aes(DATE, Cash)) +
geom_point() +
facet_wrap(~day_of_week) +
theme_bw() +
labs(x="", y="")
This data appears much easier to find a trend.
subset.atm.2 <- atm.2 %>%
filter(DATE >= as.POSIXct('2010-02-21'))
subset.atm.2 %>%
ggplot(aes(DATE, Cash)) +
geom_point(aes(color=day_of_week)) +
geom_line(aes(alpha=0.2), show.legend=FALSE) +
labs(x="", y="", color='') +
scale_color_brewer(palette = 'Dark2') +
scale_y_continuous(limit=c(0, 150)) +
theme_bw() +
theme(legend.background = element_rect(fill='white', color='black'),
legend.key = element_rect(fill='grey80'),
legend.position = c(0.7, 0.88),
legend.direction = 'horizontal')
forecast.atm.2 <- subset.atm.2$Cash %>%
ts(frequency=7, start=1) %>%
hw(h=31, damped=TRUE, level=c(80))
The predicted May values are:
forecast.atm.2$mean
## Time Series:
## Start = c(10, 7)
## End = c(15, 2)
## Frequency = 7
## [1] 69.337179 82.683535 14.448620 5.081073 105.458137 94.948565
## [7] 53.345016 69.820002 83.155757 14.910473 5.532785 105.899931
## [13] 95.380659 53.767623 70.233329 83.560009 15.305850 5.919481
## [19] 106.278136 95.750559 54.129401 70.587164 83.906075 15.644318
## [25] 6.250517 106.601903 96.067218 54.439107 70.890070 84.202331
## [31] 15.934068
The plotted predictions appear reasonable:
forecast.atm.2 %>%
autoplot() +
theme_bw() +
labs(x='Time Series', y='Cash', title='') +
theme(legend.position='none',
panel.border = element_blank())
ATM 3 is simultaniously the easiest and most challenging prediction to make. Examining the plot there are only three available values. It is unclear why all the earlier values are 0 but it is safe to assume that the ATM either wasn’t open and usable at this time.
atm.3 %>%
ggplot(aes(DATE, Cash)) +
geom_point() +
theme_bw() +
labs(x='', y='')
Three values is simply not enough to make any reasonable prediction. There are really only two choices that can be made at this stage. The first would be to inform my boss that the lack of data means that results for this ATM are essentially unreliable and that I would not feel comfortable making decisions based on any model developed based on so few points. Of course, this may not be an available option.
Thus, the second choice would be to find the ATM that is most similar to this one and use those values as my predictions. Given a more robust data set, we could select ATMs that are similar based on previously established relationships. With this data set however, we have limited information to go on.
Comparing the only three values in ATM 3 to the other ATMs last three values it is clear that ATM 1 is the most similar to ATM 3. In fact, the values are identical!
atm %>%
filter(DATE >= as.POSIXct('2010-04-28')) %>%
spread(key='ATM', value='Cash') %>%
DT::datatable()
Thus, I will use ATM 1 to model ATM 3 and reuse the predictions made for ATM 1 for ATM 3. It should be noted that ATM 2 is also incredibly similar value-wise to ATM 3 and an argument could be made for using the median of ATM 1 and ATM 2’s values. However, with as little data as we have for ATM 3 I belive the decision to be arbitrary.
I believe with just a bit more data we can determine which model would be best for ATM 3 and thus would suggest revisiting the ATM in a week or two.
forecast.atm.3 <- forecast.atm.1
forecast.atm.3$mean
## Time Series:
## Start = c(10, 7)
## End = c(15, 2)
## Frequency = 7
## [1] 84.1601124 106.0858919 66.4953373 0.8646835 98.8164126
## [6] 83.1075115 84.6762672 83.9575492 105.8928075 66.3112881
## [11] 0.6892467 98.6491853 82.9481095 84.5243242 83.8127163
## [16] 105.7547520 66.1796928 0.5638094 98.5296177 82.8341370
## [21] 84.4156849 83.7091608 105.6560422 66.0856021 0.4741215
## [26] 98.4441267 82.7526465 84.3380078 83.6351184 105.5854646
## [31] 66.0183271
Once again, there is a fundamental change with the ATM usage beginning in the past two months. Like with ATM 1 and ATM 2 I will subset the data in order to better identify the recent trend.
atm.4 <- atm.4 %>%
mutate(day_of_week = factor(format(atm.1$DATE, "%a"), levels=c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')))
atm.4 %>%
ggplot(aes(DATE, Cash)) +
geom_point() +
facet_wrap(~day_of_week) +
theme_bw() +
labs(x="", y="")
This data is easier to interpret, however it also appears to be heteroscedastic. That is, the changes in the earlier data is larger than that of the later data. This can cause difficult for the decomposition. I will attempt a BoxCox transformation to address this issue.
subset.atm.4 <- atm.4 %>%
filter(DATE >= as.POSIXct('2010-02-21'))
subset.atm.4 %>%
ggplot(aes(DATE, Cash)) +
geom_point(aes(color=day_of_week)) +
geom_line(aes(alpha=0.2), show.legend=FALSE) +
labs(x="", y="", color='') +
scale_color_brewer(palette = 'Dark2') +
scale_y_continuous(limit=c(0, 1300)) +
theme_bw() +
theme(legend.background = element_rect(fill='white', color='black'),
legend.key = element_rect(fill='grey80'),
legend.position = c(0.7, 0.88),
legend.direction = 'horizontal')
A boxcox transformation of 0.5 is equivalent to taking the square root of the variable. I will do this manually. This transformation appears to have addressed the heteroscedasticity of the data.
BoxCox.lambda(subset.atm.4$Cash)
## [1] 0.5064559
subset.atm.4 <- subset.atm.4 %>%
mutate(sCash = sqrt(Cash))
subset.atm.4 %>%
ggplot(aes(DATE, sCash)) +
geom_point(aes(color=day_of_week)) +
geom_line(aes(alpha=0.2), show.legend=FALSE) +
labs(x="", y="", color='') +
scale_color_brewer(palette = 'Dark2') +
scale_y_continuous(limit=c(0, 40)) +
theme_bw() +
theme(legend.background = element_rect(fill='white', color='black'),
legend.key = element_rect(fill='grey80'),
legend.position = c(0.7, 0.88),
legend.direction = 'horizontal')
Making predictions for this dataset proved to be quite difficult. Exploring the data I discovered that it is stationary. Some exploratory plots indicated a weekly autocorrelation but findfrequency indicated seasonality of 48 hours. However, this frequency only appears on the transformed data. Different forecasting methods resulted in widely different results indicating that no single method was able to get a strong grasp on the data.
subset.atm.4$sCash %>%
findfrequency()
## [1] 2
I ended up using the stlf function for my forecasts. This function seasonally adjusts the data and then applys a non-seasonal forecast on the data. The final results are obtained by re-seasonalizing the predictions. This produced a fair compromise in my data predictions. The predicted values are:
forecast.atm.4 <- subset.atm.4$sCash %>%
ts(frequency=7, start=1) %>%
stlf(31)
forecast.atm.4$mean ** 2
## Time Series:
## Start = c(10, 7)
## End = c(15, 2)
## Frequency = 7
## [1] 312.5372 374.9730 386.1335 22.8269 579.4351 407.1489 592.3919
## [8] 312.5372 374.9730 386.1335 22.8269 579.4351 407.1489 592.3919
## [15] 312.5372 374.9730 386.1335 22.8269 579.4351 407.1489 592.3919
## [22] 312.5372 374.9730 386.1335 22.8269 579.4351 407.1489 592.3919
## [29] 312.5372 374.9730 386.1335
forecast.atm.4 %>%
autoplot() +
theme_bw() +
labs(x='Time Series', y='Cash', title='') +
theme(legend.position='none',
panel.border = element_blank())
I will now plot all the past and predictive data and write it out to a csv file.
predictions <- data_frame(DATE = seq(as.Date('2010-05-01'), by='day', length.out=31),
ATM1 = forecast.atm.1$mean,
ATM2 = forecast.atm.2$mean,
ATM3 = forecast.atm.3$mean,
ATM4 = forecast.atm.4$mean**2) %>%
gather(key=ATM, value=Cash, -DATE) %>%
mutate(prediction = TRUE)
atm %>%
mutate(prediction = FALSE) %>%
rbind(predictions) %>%
ggplot(aes(DATE, Cash, color=prediction)) +
geom_point() +
facet_wrap(~ATM, scales='free') +
theme_bw() +
theme(legend.position = 'none') +
labs(x="", y="")
predictions %>%
select(-prediction) %>%
write_csv('./question-1-predictions.csv')
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
power <- readxl::read_excel('./ResidentialCustomerForecastLoad-624.xlsx')
power %>%
summary()
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
First I will remove the CaseSequence column and convert the Date column so that it is easier to use.
power <- power %>%
rename(Date = `YYYY-MMM`) %>%
select(-CaseSequence) %>%
mutate(Date = as.Date(paste0('01-', Date), '%d-%Y-%b'))
power %>%
DT::datatable()
The above summary indicates two issues that need to be addressed, both of which can be seen in the following plot of the data.
power %>%
ggplot(aes(Date, KWH)) +
geom_line() +
theme_bw()
The first issue is the single value whose KWH is an order of magnitude smaller than any other value in the data. Exploration of the surrounding values makes it seem likely that a digit was left off the data. In order to bring this value in line with the rest of the data I will multiply it by 10.
power %>%
filter(format(Date, '%m') == '07')
## # A tibble: 16 x 2
## Date KWH
## <date> <dbl>
## 1 1998-07-01 8914755
## 2 1999-07-01 7444416
## 3 2000-07-01 6862662
## 4 2001-07-01 7855129
## 5 2002-07-01 7039702
## 6 2003-07-01 6900676
## 7 2004-07-01 7307931
## 8 2005-07-01 8337998
## 9 2006-07-01 7945564
## 10 2007-07-01 6426220
## 11 2008-07-01 7643987
## 12 2009-07-01 7713260
## 13 2010-07-01 770523
## 14 2011-07-01 10093343
## 15 2012-07-01 8886851
## 16 2013-07-01 8415321
The second issue is the missing value for September 2008. There are a number of ways to impute this data but I have decided to find the average of the previous month and the following month and use that value. It is apparent from the data that KWH is highly seasonal and thus I do not want to disturb this for the single missing value.
power %>%
slice(128:130)
## # A tibble: 3 x 2
## Date KWH
## <date> <dbl>
## 1 2008-08-01 8037137
## 2 2008-09-01 NA
## 3 2008-10-01 5101803
power <- power %>%
mutate(KWH = ifelse(KWH == 770523, KWH * 10, KWH)) %>%
mutate(KWH = ifelse(is.na(KWH), (8037137+5101803)/2, KWH))
power %>%
summary()
## Date KWH
## Min. :1998-01-01 Min. : 4313019
## 1st Qu.:2001-12-24 1st Qu.: 5443502
## Median :2005-12-16 Median : 6351262
## Mean :2005-12-15 Mean : 6538942
## 3rd Qu.:2009-12-08 3rd Qu.: 7649733
## Max. :2013-12-01 Max. :10655730
power %>%
ggplot(aes(Date, KWH)) +
geom_line() +
theme_bw()
From viewing the data it is apparent that there is a consistent seasonal pattern that we can model. I decided to split the data into a train and test set in order to examine a variety of different models.
train <-power$KWH %>%
ts(frequency=12, start=c(1998, 1)) %>%
subset(end=160)
test <- power$KWH %>%
ts(frequency=12, start=c(1998, 1)) %>%
subset(start=161)
My selected method utilizes auto.arima and has it explore the full parameter space. The residuals for this model indicate a valid fit.
power.arima <- train %>%
auto.arima(approximation=FALSE, stepwise=FALSE, lambda='auto')
power.arima %>%
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,3)(2,1,0)[12] with drift
## Q* = 16.118, df = 18, p-value = 0.5843
##
## Model df: 6. Total lags used: 24
Plotting the predictions along with the test data shows a strong fit.
power.arima %>%
forecast(32, level=c(0)) %>%
autoplot() +
autolayer(test)
With my model selected I will retrain the data on the full dataset and make my forecasts.
predictions <- power$KWH %>%
auto.arima(approximation=FALSE, stepwise=FALSE, lambda='auto') %>%
forecast(12, level=c(0))
predictions %>%
autoplot() +
theme_bw() +
labs(x='', y='KWH')
Finally I write out my predictions.
future.dates <- power$Date %>%
tail(1) %>%
seq.Date(length=12, by='1 month')
to.write <- data_frame(CaseSequence=924:935, `YYYY-MMMM`=future.dates, KWH=predictions$mean)
to.write %>%
write_csv('./question-2-predictions.csv')
to.write %>%
DT::datatable()
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
I begin by merging the two data sets. I will need to also convert the date time to something human readable for ease of processing.
waterflow.1 <- readxl::read_excel('./Waterflow_Pipe1.xlsx')
waterflow.2 <- readxl::read_excel('./Waterflow_Pipe1.xlsx')
waterflow <- rbind(waterflow.1, waterflow.2)
I convert the date time column and separate it into two columns. The data needs to be aggregated based on hour so it makes it easier to separate that out as a different column.
waterflow <- waterflow %>%
mutate(`Date Time` = as.POSIXct(`Date Time`*60*60*24, origin='1899-12-30'),
Date = lubridate::date(`Date Time`),
Hour = lubridate::hour(`Date Time`)) %>%
select(Date, Hour, WaterFlow) %>%
arrange(Date, Hour)
waterflow %>%
DT::datatable()
As per the directions, I find the mean for when I have multiple rows representing the same hour.
waterflow <- waterflow %>%
group_by(Date, Hour) %>%
summarize(WaterFlow = mean(WaterFlow))
waterflow %>%
DT::datatable()
There are three hours that have missing values. I will impute them with the median of all the other values.
missing <- data_frame(Date=c(as.Date('2015-10-27'), as.Date('2015-10-31'), as.Date('2015-10-27')), Hour=c(17, 22, 10), WaterFlow=rep(median(waterflow$WaterFlow), 3))
waterflow <- waterflow %>%
bind_rows(missing) %>%
arrange(Date, Hour)
As per the directions, I need to determine whether the data is stationary or needs to be modified to be made stationary.
I examined 3 difference sets of data, all of which support the fact that the data is stationary. ndiffs suggests the data needs to be differenced 0 times. The kpss test’s p-value is suitably large and the ggtsdisplay shows no spikes in the ACF or PACF plot. In conclusion, the data is stationary.
waterflow$WaterFlow %>%
ndiffs()
## [1] 0
waterflow$WaterFlow %>%
ggtsdisplay()
waterflow$WaterFlow %>%
ts(frequency=24, start=c(1, 17)) %>%
urca::ur.kpss()
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.2447
Although the data is stationary, that does not mean it will be easy to predict. I created a number of different plots in order to determine whether there was some exploitable pattern to the data that could be used to model it. As seen in the below plot, there is no dicernable pattern when considering Hour of Day or Day of Week.
waterflow %>%
ungroup() %>%
mutate(dow = as.factor(as.POSIXlt(Date)$wday)) %>%
ggplot(aes(dow, WaterFlow)) +
geom_boxplot() +
geom_point(alpha=0.2) +
theme_bw() +
scale_x_discrete(labels=c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')) +
labs(x='Day of Week') +
theme(panel.border = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank())
waterflow %>%
ggplot(aes(Hour, WaterFlow, color=as.factor(Date))) +
geom_point() +
geom_line(aes(group=Date)) +
labs(color='Date') +
theme_bw() +
theme(legend.background = element_rect(color='black'))
With no discernable, exploitable pattern and the need to make predictions far into the future relative to the size of the training data, I have decided to damp the data. This moves the data back towards a consistent value instead of following the trend indefinitely into the future.
fit <- waterflow$WaterFlow %>%
ts(frequency=24, start=17) %>%
hw(h=168, damped=TRUE, level=c(80))
fit %>%
autoplot() +
labs(x='', y='Waterflow') +
theme_bw() +
theme(legend.position='none')
Finally, I will write out the data to a csv file.
origin <- as.POSIXct('1899-12-30', format='%Y-%m-%d')
from <- waterflow %>%
tail(1) %>%
mutate(dt = paste0(Date, 'T', Hour)) %>%
.$dt %>%
as.POSIXct(format="%Y-%m-%dT%H") + 9*60*60
dt <- seq.POSIXt(from, by="hour", length.out=168) %>%
difftime(origin, units=c('days')) %>%
as.numeric()
predictions <- data_frame(`Date Time` = dt, WaterFlow = fit$mean)
predictions %>%
write_csv('./question-3-predictions.csv')
predictions %>%
DT::datatable()