Data File: ATM624Data.xlsx
Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. Explain and demonstrate your process, techniques used and not used, and your actual forecast. Please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.
Â
1. Read in the data from the data file and take a look at it's structure.
# Read in the ATM dataset using the rio package's import() function.
ATMData <- import('./data/ATM624Data.xlsx', col_types = c('date', 'text', 'numeric'))
# Look at the raw data.
paged_table(ATMData, options = list(rows.print = 30))
A quick inspection of the original data reveals that there are missing values so we will remove rows with missing values, and also reformat the data so that it is easier to work with.
2. Clean and reformat the data so that each ATM has it's own column containing the cash withdrawal amounts for that ATM.
The first order of business is to remove columns that contain empty values.
# Drop rows containing empty values and plot the data.
ATMData <- ATMData %>% drop_na()
ggplot(ATMData, aes(DATE, Cash)) + geom_line() + facet_grid(ATM~., scales = 'free') +
labs(title = 'Cash Withdrawals From ATM1, ATM2, ATM3, and ATM4', y = 'Withdrawals in Hundreds of dollars', x = 'Date')
The above plot reveals that:
There are no withdrawals from ATM3 until around the end of April 2010. Given the limited amount of data for this ATM, we may not be able to make an accurate forecast for this machine.
ATM4 has one major outlier around March 2010.
We will now reformat the data so that each ATM has its own column containing the cash withdrawal amounts for that ATM.
# Reformat the data using the tidyr package's spread() function.
ATMData <- ATMData %>% spread(ATM, Cash)
# Summarize the data after reformatting the data.
summary(ATMData)
## DATE ATM1 ATM2 ATM3
## Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : 0.0000
## 1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: 0.0000
## Median :2009-10-30 Median : 91.00 Median : 67.00 Median : 0.0000
## Mean :2009-10-30 Mean : 83.89 Mean : 62.58 Mean : 0.7206
## 3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000
## Max. :2010-04-30 Max. :180.00 Max. :147.00 Max. :96.0000
## NA's :3 NA's :2
## ATM4
## Min. : 1.563
## 1st Qu.: 124.334
## Median : 403.839
## Mean : 474.043
## 3rd Qu.: 704.507
## Max. :10919.762
##
The above summary of the reformatted data reveals that:
ATM1 and ATM2 contain empty values.
There is a major outlier for ATM4 - a cash withdrawal of 10919.762.
To further refine the data, we will remove the empty values from ATM1 and ATM2, and also replace ATM4's outlier with the median for that machine's data.
# Replace ATM4's withdrawal outlier of 10919.762 with the median for that ATM.
ATMData$ATM4[ATMData$ATM4==max(ATMData$ATM4)] <- median(ATMData$ATM4)
# Drop rows with empty values for ATM1 and ATM2.
ATMData <- ATMData %>% drop_na()
# Resummarize the data to see the effect of the changes.
summary(ATMData)
## DATE ATM1 ATM2
## Min. :2009-05-01 00:00:00 Min. : 1.00 Min. : 0.00
## 1st Qu.:2009-08-03 18:00:00 1st Qu.: 73.00 1st Qu.: 25.00
## Median :2009-11-01 12:00:00 Median : 91.00 Median : 66.00
## Mean :2009-10-31 20:28:00 Mean : 84.11 Mean : 62.37
## 3rd Qu.:2010-01-30 06:00:00 3rd Qu.:108.00 3rd Qu.: 93.25
## Max. :2010-04-30 00:00:00 Max. :180.00 Max. :147.00
## ATM3 ATM4
## Min. : 0.0000 Min. : 1.563
## 1st Qu.: 0.0000 1st Qu.: 127.719
## Median : 0.0000 Median : 403.850
## Mean : 0.7306 Mean : 447.575
## 3rd Qu.: 0.0000 3rd Qu.: 704.271
## Max. :96.0000 Max. :1712.075
The above summary confirms that the NA values have now been removed from ATM1 and ATM2, and the outlier for ATM4 has now been resolved. We will now re-plot the data to confirm that ATM4's outlier no longer exists.
# Replot the data to confirm that the outlier for ATM4 has been resolved.
ATMDataCleaned <- ATMData %>%
pivot_longer(cols = starts_with('ATM'), names_to = 'ATM', values_to = 'Cash')
ggplot(ATMDataCleaned, aes(DATE, Cash)) + geom_line() + facet_grid(ATM~., scales = 'free') +
labs(title = 'Cash Withdrawals From ATM1, ATM2, ATM3, and ATM4', y = 'Withdrawals in Hundreds of dollars', x = 'Date')
The above plot confirms that ATM4's outlier no longer exists.
We will now take a look at the reformatted data.
# Take a look at the reformatted data.
paged_table(ATMData, options = list(rows.print = 30))
Looking at the table above, we can see that the greatest amount of cash is withdrawn from ATM4, followed by ATM1 and ATM2, with close to zero withdrawals being made from ATM3 (ATM3 only has 3 cash withdrawals - 96, 82, and 85 hundred dollars).
Now that we have insight into the data, we can convert the dataframe into a time series and work on our models in preparation for forecasting. For this project, I will be using ARIMA models to forecast the data.
The first step is to convert the entire data set into a time series and then we will focus on the ARIMA models for each individual ATM machine.
# Convert the dataset to a timeseries using the ts() function.
ATMDataTimeSeries <- ts(ATMData %>% select(-DATE), start = 1, frequency = 7)
Â
atmOne <- ATMDataTimeSeries[,1]
ggtsdisplay(atmOne,
main = 'Cash Withdrawals for ATM1',
ylab = 'Withdrawal Amount ($K)',
xlab = 'Week')
The above series plot for ATM1 displays weekly seasonality. The ACF and PACF plots also contain lags that fall outside of the boundaries suggesting that the data is not white noise. This tells us that this is a non-stationary series.
ARIMA models require stationarity, so we will apply a KPSS Unit Root Test to confirm the findings of the above plots that suggest the data is not stationary.
atmOne %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.5005
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The KPSS Unit Root Test results above confirm that the data is not stationary as the test-statistic value is greater than the significance levels. To address this issue, we will try differencing the data at lag 7, and run the KPSS test again to see if the data becomes stationary.
atmOne %>% diff(1) %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0085
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
As we can see from the KPSS Unit Root Test results above, the data is made stationary by differencing at lag 7.
In order to select the most appropriate ARIMA model for forecasting, we will use the auto.arima() function to suggest the best model to use.
# Run the series through the auto.arima() function and have it automatically select an appropriate ARIMA model.
lambda <- BoxCox.lambda(atmOne)
atmOneAutoArimaSelection <- auto.arima(atmOne, lambda = lambda)
# Print out the results of the selection.
summary(atmOneAutoArimaSelection)
## Series: atmOne
## ARIMA(2,0,0)(0,1,1)[7]
## Box Cox transformation: lambda= 0.2380317
##
## Coefficients:
## ar1 ar2 sma1
## 0.1058 -0.1586 -0.5660
## s.e. 0.0525 0.0530 0.0469
##
## sigma^2 estimated as 1.723: log likelihood=-596.76
## AIC=1201.53 AICc=1201.64 BIC=1216.99
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 2.257459 26.82766 17.42781 -90.23328 110.3011 0.9545411 0.02816369
The auto.arima() function has selected an ARIMA(2,0,0)(0,1,1)[7] model which is seasonal and gels with our KPSS Unit Root Test findings, so we will use this to forecast how much cash will be taken out of ATM1 in May 2010.
# Forecast cash withdrawals form ATM1 during the month of May 2010 and plot the results.
atmOneForecast <- forecast(atmOneAutoArimaSelection, 31, level = 95)
autoplot(atmOneForecast) +
labs(title = 'ARIMA(2,0,0)(0,1,1)[7] Forecast',
subtitle = 'Forecast of Cash Withdrawals From ATM1 For The Month of May 2010') +
scale_y_continuous('Withdrawals In Hundreds of Dollars', labels = scales::dollar_format(scale = 0.1, suffix = 'K')) +
xlab('Days')
The above forecast looks good as it follows the general trend of the data.
atmOneForecast %>% write.csv('./data/SHaslettPartAatmOneForecast.csv', row.names = FALSE)
Â
atmTwo <- ATMDataTimeSeries[,2]
ggtsdisplay(atmOne,
main = 'Cash Withdrawals for ATM2',
ylab = 'Withdrawal Amount ($K)',
xlab = 'Week')
The ACF and PACF plots for ATM2 above contain lags that fall outside of the boundaries suggesting that the data is not white noise. This tells us that this is a non-stationary series.
we will apply a KPSS Unit Root Test to confirm that the data is not stationary.
atmTwo %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 2.0498
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The KPSS Unit Root Test results above confirm that the data is not stationary as the test-statistic value is greater than the significance levels. We will try differencing the data at lag 7, and run the KPSS test again to see if the data becomes stationary.
atmTwo %>% diff(1) %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0148
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The KPSS Unit Root Test results above confirm that the data is made stationary by differencing at lag 7.
In order to select the most appropriate ARIMA model for forecasting, we will use the auto.arima() function to suggest the best model to use.
# Run the series through the auto.arima() function and have it automatically select an appropriate ARIMA model.
lambda <- BoxCox.lambda(atmTwo)
atmTwoAutoArimaSelection <- auto.arima(atmTwo, lambda = lambda)
# Print out the results of the selection.
summary(atmTwoAutoArimaSelection)
## Series: atmTwo
## ARIMA(3,0,4)(0,1,1)[7]
## Box Cox transformation: lambda= 0.6382776
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 ma4 sma1
## -0.7637 0.6012 0.6620 0.8389 -0.7285 -0.9425 -0.0163 -0.4198
## s.e. 0.0866 0.0782 0.0594 0.1029 0.0859 0.0739 0.0811 0.0549
##
## sigma^2 estimated as 39.58: log likelihood=-1148.69
## AIC=2315.37 AICc=2315.9 BIC=2350.17
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.1344195 26.11895 18.5345 -Inf Inf 0.8548052 -0.02914399
The auto.arima() function has selected an ARIMA(2,0,0)(0,1,1)[7] model, so we will use this to forecast how much cash will be taken out of ATM2 in May 2010.
# Forecast cash withdrawals form ATM2 during the month of May 2010 and plot the results.
atmTwoForecast <- forecast(atmTwoAutoArimaSelection, 31, level = 95)
autoplot(atmTwoForecast) +
labs(title = 'ARIMA(3,0,4)(0,1,1)[7] Forecast',
subtitle = 'Forecast of Cash Withdrawals From ATM2 For The Month of May 2010') +
scale_y_continuous('Withdrawals In Hundreds of Dollars', labels = scales::dollar_format(scale = 0.1, suffix = 'K')) +
xlab('Days')
atmTwoForecast %>% write.csv('./data/SHaslettPartAatmTwoForecast.csv', row.names = FALSE)
Â
atmThree <- ATMDataTimeSeries[,3]
ggtsdisplay(atmThree,
main = 'Cash Withdrawals for ATM3',
ylab = 'Withdrawal Amount ($K)',
xlab = 'Week')
There is not enough data for ATM3 to make an accurate forecast model, so we will use a random walk with drift to make a forecast.
# Forecast cash withdrawals from ATM3 during the month of May 2010 using a random walk with drift and plot the results.
atmThreeForecast <- rwf(atmThree, h = 31, drift=TRUE)
autoplot(atmThreeForecast) +
labs(title = 'Random Walk With Drift Forecast For ATM3') +
scale_y_continuous('Withdrawals In Hundreds of Dollars', labels = scales::dollar_format(scale = 0.1, suffix = 'K')) +
xlab('Days')
atmThreeForecast %>% write.csv('./data/SHaslettPartAatmThreeForecast.csv', row.names = FALSE)
Â
atmFour <- ATMDataTimeSeries[,4]
ggtsdisplay(atmFour,
main = 'Cash Withdrawals for ATM4',
ylab = 'Withdrawal Amount ($K)',
xlab = 'Week')
The ACF and PACF plots for ATM4 above show us that most of the lags are within the boundaries so the data is most likely stationary. To confirm this, we will apply a KPSS Unit Root Test.
atmFour %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.1506
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The above KPSS Unit Root Test results above confirm that the data for ATM4 is indeed stationary.
In order to select the most appropriate ARIMA model for forecasting, we will use the auto.arima() function to suggest the best model to use.
# Run the series through the auto.arima() function and have it automatically select an appropriate ARIMA model.
lambda <- BoxCox.lambda(atmFour)
atmFourAutoArimaSelection <- auto.arima(atmFour, lambda = lambda)
# Print out the results of the selection.
summary(atmFourAutoArimaSelection)
## Series: atmFour
## ARIMA(1,0,1)(2,0,1)[7] with non-zero mean
## Box Cox transformation: lambda= 0.4158853
##
## Coefficients:
## ar1 ma1 sar1 sar2 sma1 mean
## -0.5873 0.6667 0.8962 0.0086 -0.7347 24.6601
## s.e. 0.2620 0.2402 0.1042 0.0702 0.0902 1.4837
##
## sigma^2 estimated as 118.7: log likelihood=-1368.66
## AIC=2751.31 AICc=2751.63 BIC=2778.52
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 89.37816 350.3995 271.4736 -360.4888 405.2597 0.7860176 0.01588157
The auto.arima() function has selected an ARIMA(1,0,1)(2,0,1)[7]] model, so we will use this to forecast how much cash will be taken out of ATM4 in May 2010.
# Forecast cash withdrawals form ATM4 during the month of May 2010 and plot the results.
atmFourForecast <- forecast(atmFourAutoArimaSelection, 31, level = 95)
autoplot(atmFourForecast) +
labs(title = 'ARIMA(1,0,1)(2,0,1)[7] Forecast',
subtitle = 'Forecast of Cash Withdrawals From ATM4 For The Month of May 2010') +
scale_y_continuous('Withdrawals In Hundreds of Dollars', labels = scales::dollar_format(scale = 0.1, suffix = 'K')) +
xlab('Days')
atmFourForecast %>% write.csv('./data/SHaslettPartAatmFourForecast.csv', row.names = FALSE)
Â
Data File: ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple data set of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Â
1. Read in the data from the data file and take a look at it's structure.
residentialPowerUsage <- import('./data/ResidentialCustomerForecastLoad-624.xlsx')
paged_table(residentialPowerUsage, options = list(rows.print = 30))
# Summarize the data.
summary(residentialPowerUsage)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
2. Data Transformation
Looking at the summary of the raw data above, we can see that the KWH column contains one empty value, and one major outlier of 770523 KWH. It also looks like the date format will be problematic when plotting and forecasting the data. Before taking care of the empty value and outlier, we will reformatting the date format so we can plot the data to inspect it visually.
# Format the date column into a compatible format.
residentialPowerUsage$`YYYY-MMM` <- paste0(residentialPowerUsage$`YYYY-MMM`, '-01')
residentialPowerUsage$date <- ymd(residentialPowerUsage$`YYYY-MMM`)
# Plot the data.
ggplot(residentialPowerUsage, aes(date, KWH)) + geom_line() +
labs(title = 'Residential Power Usage', y = 'Power Consumption In kWh', x = 'Time')
Th above plot confirms the outlier visually.
To take care of both the missing value and the outlier, I will impute both values with the mean of the data for the month in which they occur.
# The missing value occurs in September, 2008, so imput the
# missing value with the mean of the data for that month.
residentialPowerUsage$Month <- month(residentialPowerUsage$date)
residentialPowerUsage$KWH[is.na(residentialPowerUsage$KWH)] <- mean(residentialPowerUsage$KWH[residentialPowerUsage$Month == 9], na.rm = TRUE)
# The outlier occurs in July, 2010, so imput the outlier with the mean of the data for that month.
residentialPowerUsage$KWH[residentialPowerUsage$KWH == min(residentialPowerUsage$KWH)] <- mean(residentialPowerUsage$KWH[residentialPowerUsage$Month == 7], na.rm = TRUE)
Now that we have imputed the data to take care of both the outlier and the missing value, we will summarize and re-plot the data to confirm that the issues are indeed resolved.
# Summarize and plot the data to confirm that both the missing value and outlier have been taken care.
summary(residentialPowerUsage)
## CaseSequence YYYY-MMM KWH date
## Min. :733.0 Length:192 Min. : 4313019 Min. :1998-01-01
## 1st Qu.:780.8 Class :character 1st Qu.: 5443502 1st Qu.:2001-12-24
## Median :828.5 Mode :character Median : 6351262 Median :2005-12-16
## Mean :828.5 Mean : 6543304 Mean :2005-12-15
## 3rd Qu.:876.2 3rd Qu.: 7649733 3rd Qu.:2009-12-08
## Max. :924.0 Max. :10655730 Max. :2013-12-01
## Month
## Min. : 1.00
## 1st Qu.: 3.75
## Median : 6.50
## Mean : 6.50
## 3rd Qu.: 9.25
## Max. :12.00
# Plot the data after imputation.
ggplot(residentialPowerUsage, aes(date, KWH)) + geom_line() +
labs(title = 'Residential Power Usage After Imputation', y = 'Power Consumption In kWh', x = 'Time')
The above summary and plot confirms that the imputation has taken care of both the missing value, and the outlier.
Â
Now that we have cleansed the data, we can convert the data into a time series and work on our ARIMA model in preparation for creating a 2014 monthly forecast.
1. Convert the data into a time series.
# Convert the data to a time series..
residentialPowerUsageTimeSeries <- ts(residentialPowerUsage[,'KWH'], start = c(1998, 1), frequency = 12)
2. Analyze the time series.
# Plot the time series data.
autoplot(residentialPowerUsageTimeSeries) +
labs(title = 'Monthly Residential Power Usage', subtitle = 'January 1998 Through December 2013') +
ylab('Power Consumption In kWh') +
xlab('Month')
The above plot reveals that the time series is seasonal, and displays peaks roughly every 6 months. This is probably due to the fact that more electricity is consumed in the winter and summer months. The series also displays an upward trend.
In order to select the most appropriate ARIMA model for forecasting, we will use the auto.arima() function.
# Run the series through the auto.arima() function and have it automatically select an appropriate ARIMA model.
lambda <- BoxCox.lambda(residentialPowerUsageTimeSeries)
residentialPowerConsumptionAutoArimaSelection <- auto.arima(residentialPowerUsageTimeSeries, lambda = lambda)
# Print out the results of the selection.
summary(residentialPowerConsumptionAutoArimaSelection)
## Series: residentialPowerUsageTimeSeries
## ARIMA(0,0,1)(2,1,0)[12] with drift
## Box Cox transformation: lambda= -0.2018638
##
## Coefficients:
## ma1 sar1 sar2 drift
## 0.2444 -0.7265 -0.3599 0.0001
## s.e. 0.0828 0.0750 0.0766 0.0001
##
## sigma^2 estimated as 0.00001647: log likelihood=742.97
## AIC=-1475.95 AICc=-1475.6 BIC=-1459.98
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 48157 626456.3 477342.9 0.1542706 7.247545 0.7798858 0.09981818
The auto.arima() function has selected an ARIMA(0,0,1)(2,1,0)[12] with drift model, so we will use this to create our 2014 monthly power consumption forecast.
# Forecast monthly power consumption for 2014 and plot the results.
residentialPowerConsumptionForecast <- forecast(residentialPowerConsumptionAutoArimaSelection, 31, level = 95)
autoplot(residentialPowerConsumptionForecast) +
labs(title = 'ARIMA(0,0,1)(2,1,0)[12] With Drift Forecast',
subtitle = '2014 Monthly Power Consumption Forecast') +
ylab('Power Consumption In kWh') +
xlab('Month')
The above forecast looks really good and follows the data trend perfectly.
residentialPowerConsumptionForecast %>% write.csv('./data/SHaslettPartBresidentialPowerConsumptionForecast.csv', row.names = FALSE)
Data Files: Waterflow_Pipe1.xlsx, Waterflow_Pipe2.xlsx
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour. Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
1. Read in the data from the data files and take a look at it's structure.
# Import the 2 data files.
waterFlowPipeOne <- import('./data//Waterflow_Pipe1.xlsx', col_types = c("date", 'numeric'))
waterFlowPipeTwo <- import('./data//Waterflow_Pipe2.xlsx', col_types = c('date', 'numeric'))
Now that we have imported the data, lets look at the structure of both data sets.
Water Flow Pipe 1 Data Structure
# Look at the data structure for pipe 1.
paged_table(waterFlowPipeOne, options = list(rows.print = 15))
Water Flow Pipe 2 Data Structure
# Look at the data structure for pipe 1.
paged_table(waterFlowPipeTwo, options = list(rows.print = 15))
Looking at the structure of the 2 data sets above, we can see that both files are measuring water flow. However, they are recording the data at different time intervals. The Water Flow Pipe 1 data set records the data at irregular intervals where as the Water Flow Pipe 2 data set records the data every hour. The fact that Water Flow Pipe 2 records the data at regular intervals makes our lives easier as we can join the 2 data sets and then group by mean water flow per hour.
2. Prepare the waterFlowPipeOne data set for aggregation.
# Remove the space from the Date Time column names to make binding easier.
colnames(waterFlowPipeOne) <- c('DateTime', 'WaterFlow')
colnames(waterFlowPipeTwo) <- c('DateTime', 'WaterFlow')
# Prep the waterFlowPipeOne data set for aggregation and make
# the timestamp conform to that of waterFlowPipeTwo.
waterFlowPipeOne <- waterFlowPipeOne %>%
mutate(Date = as.Date(DateTime), Time = hour(DateTime) + 1) %>%
group_by(Date, Time) %>%
summarise(WaterFlow = mean(WaterFlow)) %>%
ungroup() %>%
mutate(DateTime = ymd_h(paste(Date, Time))) %>%
select(DateTime, WaterFlow)
# Re-examine the dataset after transformation to ensure that the timestamps match those of waterFlowPipeTwo.
paged_table(waterFlowPipeOne, options = list(rows.print = 20))
After transforming the waterFlowPipeOne data set, we can see from the above table that the timestamp now matches that of waterFlowPipeTwo. We can now aggregate the 2 data sets.
3. Aggregate the two data sets.
# Combine the waterFlowPipeOne data set with waterFlowPipeTwo.
waterFlowPipes <- full_join(waterFlowPipeOne, waterFlowPipeTwo, by = 'DateTime', suffix = c('_PipeOne', '_PipeTwo')) %>%
mutate(WaterFlow_PipeOne = ifelse(is.na(WaterFlow_PipeOne), 0, WaterFlow_PipeOne)) %>%
mutate(WaterFlow_PipeTwo = ifelse(is.na(WaterFlow_PipeTwo), 0, WaterFlow_PipeTwo)) %>%
mutate(WaterFlow = WaterFlow_PipeOne + WaterFlow_PipeTwo) %>%
select(DateTime, WaterFlow)
# Re-examine the dataset after aggregation.
paged_table(waterFlowPipes, options = list(rows.print = 30))
Now that the data has been aggregated, we will convert it to a time series and check for stationarity.
# Convert the aggregated data to a time series and plot the results.
waterFlowTimeSeries <- ts(waterFlowPipes$WaterFlow)
autoplot(waterFlowTimeSeries) +
labs(title = 'Water Flow Time Series') +
ylab('Waterflow Measurement') +
xlab('Time')
Looking at the variance in the Water Flow time series above, it is unlikely that the data is stationary. To confirm this, we will perform a KPSS Unit root test.
# Perform a KPSS Unit root test on the time series to check for stationarity.
waterFlowTimeSeries %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 5.5327
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The results of the KPSS test confirm that the data is not stationary as the test-statistic value is greater than the significance levels. To address this issue, we will try differencing the data once at lag 1, and run the KPSS test again to see if the data becomes stationary.
# Difference the data and retest.
waterFlowTimeSeries %>% diff() %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0092
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The above test results confirm that the data has become stationary after differencing. To confirm this visually, we will plot the differenced time series.
# Replot the time series after differencing to check for stationarity.
waterFlowTimeSeriesDifferenced <- waterFlowTimeSeries %>% diff()
autoplot(waterFlowTimeSeriesDifferenced) +
labs(title = 'Stationary Water Flow Time Series After Differencing')
The above plot visually confirms that the data is stationary after differencing.
Now that we have confirmed that that data in stationary, we will use the auto.arima() function to select a non-seasonal ARIMA model to provide a week forward forecast of water flow measurements.
# Use the auto.arima() function to select a non-seasonal ARIMA model for forecasting..
waterFlowAutoArima <- auto.arima(waterFlowTimeSeries, seasonal = FALSE)
waterFlowAutoArima
## Series: waterFlowTimeSeries
## ARIMA(1,1,1)
##
## Coefficients:
## ar1 ma1
## -0.0512 -0.9493
## s.e. 0.0334 0.0115
##
## sigma^2 estimated as 268.4: log likelihood=-4211.09
## AIC=8428.18 AICc=8428.21 BIC=8442.9
The auto.arima() function has selected an ARIMA(1,1,1) model, so we will use this to forecast the water flow measurements.
# Forecast the data using the auto selected ARIMA model.
waterFlowForecast <- forecast(waterFlowAutoArima, h = 7 * 24)
# Check the risiduals.
checkresiduals(waterFlowForecast)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)
## Q* = 2.1151, df = 8, p-value = 0.9773
##
## Model df: 2. Total lags used: 10
# Plot the forecast.
autoplot(waterFlowForecast) +
labs(title = 'ARIMA(1,1,1) Water Flow Forecast',
subtitle = 'Week Forward Forecast Of Water Flow Measurements',
y = 'Water Flow',
x = 'Day')
Looking at the above results, The ARIMA(1,1,1) model seems to be a good fit for this forecast. Most of the autocorrelations in the ACF plot are within the boundaries, and the p-value generated by the Ljung-Box is above the significance level. The forecast plot is also satisfactory so we will move on to export the forecast as a CSV file.
waterFlowForecast %>% write.csv('./data/SHaslettPartCWaterFlow.csv', row.names = FALSE)