Data 624 Project 1

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readxl)
library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(fpp3)

## -- Attaching packages -------------------------------------------- fpp3 0.4.0 --

## v lubridate   1.7.10         v feasts      0.2.2     
## v tsibble     1.0.1.9999     v fable       0.3.1     
## v tsibbledata 0.3.0.9000

## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date()    masks base::date()
## x mice::filter()       masks dplyr::filter(), stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(openxlsx)

Project 1

Part A Atm

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

atmData <- read_xlsx("./ATM624Data.xlsx") %>% mutate(ATM=  as.integer(str_remove(ATM,"ATM"))) %>% mutate(DATE=lubridate::date(DATE))%>% as_tsibble(key=ATM, index=DATE)%>% drop_na
hasna<-read_xlsx("./ATM624Data.xlsx") %>% mutate(ATM=  as.integer(str_remove(ATM,"ATM"))) %>% mutate(DATE=lubridate::date(DATE))%>% as_tsibble(key=ATM, index=DATE)
atmData %>% autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

atmData%>%filter(ATM==3)%>%autoplot(vars(Cash))+labs(title = "ATM 3")

Pulling the data in we see an extreme outlier on ATM 4 though it is not so extreme as to be out of the realm of possibility. Still we will have to patch it out with the mean, and ask SMEs to track down the cause of the outlier to see if it can be replicated and predicted separately. Otherwise it is destructive to models. It is immediately obvious that ATMs have vastly different daily use, with ATM 3 not being used for much of the period. We choose to drop a small run of data that seems to have missing attribution and cash out.

Action Item Investigate Feb 10th ATM4 outlier to determine cause and when if ever it would repeat.

atmData<-atmData%>%mutate(Cash=na_if(Cash,max(atmData$Cash)))%>%mutate(Cash = case_when(is.na(Cash) ~ mean(Cash, na.rm=TRUE),
                               TRUE ~ as.numeric(Cash) 
                              ))

While there doesn’t seem to be much seasonality in terms of seasons it is unsurprising that there is a day of the week effect. Friday is apparently not a popular day to take out money.

atmData%>% as_tibble %>% group_by(wday = lubridate::wday((DATE)), ATM=as.character(ATM))%>% summarize(Cash = mean(Cash)) %>% ggplot(aes(x=wday, y=Cash, fill=ATM))+geom_bar(position="dodge",stat="identity")

## `summarise()` has grouped output by 'wday'. You can override using the `.groups` argument.

We should check for any month day position effect, though it is not as strong as a weekday.

atmData%>% as_tibble %>% group_by(mday = lubridate::mday((DATE)), ATM=as.character(ATM))%>% summarize(Cash = mean(Cash)) %>% ggplot(aes(x=mday, y=Cash, fill=ATM))+geom_bar(position="dodge",stat="identity")

## `summarise()` has grouped output by 'mday'. You can override using the `.groups` argument.

We don’t have a long enough time span to make definite take aways throughout the year, but there is some variation, and interestingly it doesn’t involve any holiday season increase.

atmData%>% as_tibble %>% group_by(month = lubridate::month((DATE)),ATM=as.character(ATM))%>% summarize(Cash = mean(Cash)) %>% ggplot(aes(x=month, y=Cash, fill=ATM))+geom_bar(position="dodge",stat="identity")

## `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.

Perhaps unsurprisingly the balance of the evidence suggests the need for a per ATM model rather than trying to generalize.

atm1Model<-atmData%>%filter(ATM==1)%>%fill_gaps(.full=TRUE)%>%replace_na(list(Cash=0))%>%model(ETS(Cash))
atm2Model<-atmData%>%filter(ATM==2)%>%fill_gaps(.full=TRUE)%>%replace_na(list(Cash=0))%>%model(ETS(Cash))
atm4Model<-atmData%>%filter(ATM==4)%>%fill_gaps(.full=TRUE)%>%replace_na(list(Cash=0))%>%model(ETS(Cash))
components(atm1Model) %>%  autoplot()

## Warning: Removed 7 row(s) containing missing values (geom_path).

components(atm2Model) %>%  autoplot()

## Warning: Removed 7 row(s) containing missing values (geom_path).

components(atm4Model) %>%  autoplot()

## Warning: Removed 7 row(s) containing missing values (geom_path).

atm1forecast<-atm1Model%>%forecast(h="1 month")
atm1forecast%>%autoplot(atmData%>%filter(ATM==1))

atm2forecast<-atm2Model%>%forecast(h="1 month")
atm2forecast%>%autoplot(atmData%>%filter(ATM==2))

atm4forecast<-atm4Model%>%forecast(h="1 month")
atm4forecast%>%autoplot(atmData%>%filter(ATM==4))

bind_rows(atm1forecast,atm2forecast,atm4forecast)%>%select(DATE,Cash,.mean)%>%openxlsx::write.xlsx("ATMForecast.xlsx",asTable=TRUE,overwrite = TRUE)

We decline to make a prediction for ATM 3 because the data is simply too limited (should a hard requirement be presented mean of the 1 and 2 is probably the best that can be done, but the error bands should be assumed to be massive) and repeat the need for further investigation into the outlier for ATM 4. Predictions are based on a weekly seasonal component.

Part B Residential Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

residentialLoad <-read_xlsx("ResidentialCustomerForecastLoad-624.xlsx",.name_repair = 'universal')%>%mutate(Date=yearmonth(YYYY.MMM))%>%select(KWH,Date)%>%as_tsibble(index=Date)

## New names:
## * `YYYY-MMM` -> YYYY.MMM

residentialLoad%>% autoplot(vars(KWH))

imp<-mice(residentialLoad %>%mutate(Date=date(Date)))

## 
##  iter imp variable
##   1   1  KWH
##   1   2  KWH
##   1   3  KWH
##   1   4  KWH
##   1   5  KWH
##   2   1  KWH
##   2   2  KWH
##   2   3  KWH
##   2   4  KWH
##   2   5  KWH
##   3   1  KWH
##   3   2  KWH
##   3   3  KWH
##   3   4  KWH
##   3   5  KWH
##   4   1  KWH
##   4   2  KWH
##   4   3  KWH
##   4   4  KWH
##   4   5  KWH
##   5   1  KWH
##   5   2  KWH
##   5   3  KWH
##   5   4  KWH
##   5   5  KWH

residentialLoad['KWH'] <- complete(imp)['KWH']
fit<-residentialLoad%>% fill_gaps(.full=TRUE)%>% model(ETS(KWH))
components(fit) %>%  autoplot()

## Warning: Removed 12 row(s) containing missing values (geom_path).

fit%>%forecast(h='1 year')%>%autoplot(residentialLoad)

fit%>%forecast(h='1 year')%>%select(Date,KWH,.mean)%>%openxlsx::write.xlsx("residentailLoadForecast.xlsx",asTable=TRUE, overwrite=TRUE)

We see one large negative outlier and some discontinuity, otherwise there is an unsurprising seasonal component with an increasing trend. We use MICE to impute the missing value, but do not otherwise modify the outlier.

Extra Waterflow

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

waterflow1<-read_xlsx('Waterflow_Pipe1.xlsx', col_types=c("numeric","numeric"), .name_repair = 'universal')%>%mutate(Date.Time=excel_numeric_to_date(Date.Time, include_time=T))%>% as_tsibble(index=Date.Time)

## New names:
## * `Date Time` -> Date.Time

waterflow2<-read_xlsx('Waterflow_Pipe2.xlsx', col_types=c("numeric","numeric"), .name_repair = 'universal')%>%mutate(Date.Time=excel_numeric_to_date(Date.Time, include_time=T))%>% as_tsibble(index=Date.Time)

## New names:
## * `Date Time` -> Date.Time

waterflowData <- bind_rows(list(waterflow1,waterflow2))
waterflowHourly<-  waterflowData%>%index_by(hour = floor_date(Date.Time,unit="hours")) %>%
  summarise(MeanFlow = mean(WaterFlow))
waterflowHourly %>% autoplot()

## Plot variable not specified, automatically selected `.vars = MeanFlow`

We see what may be an issue with data where it seems that the average increasing markedly after November 3rd. This is a data quality issue that should be investigated in more detail, but for now we will just exclude from before the inflection point. We do not have enough data to know when or if it would return to the lower level assuming it is true data.

waterflowHourly<-waterflowHourly%>%as_tibble%>% filter(date(hour)>date("2015-11-03"))%>%as_tsibble(index=hour)%>%fill_gaps(.full=TRUE)
waterflowHourly%>%features(MeanFlow, unitroot_kpss)

## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.132         0.1

waterflowHourly%>%mutate(difMeanFlow=difference(MeanFlow))%>%features(difMeanFlow, unitroot_kpss)

## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0119         0.1

fit<-waterflowHourly%>%model(ARIMA(MeanFlow~-1+pdq(0,1,1)+PDQ(1,1,2)))
fit%>%forecast(h="7 days")%>%autoplot(waterflowHourly)

openxlsx::write.xlsx(fit%>%forecast(h="10 days")%>%select(hour,MeanFlow, .mean),"waterFlowHourlyForecast.xlsx",asTable=TRUE,overwrite = TRUE)

Trimming the data to the “unaffected” data, we get a kpss that is stationary even without differencing. Honestly, depending on the use case using mean in a given hour seems a very odd choice as there would seem many situations where this would not be an accurate and even potentially misleading statistic.

Data 624 Project 1

Scott Reed

10/26/2021

Project 1

Part A Atm

Part B Residential Power

Extra Waterflow