Part A – ATM Forecast

Data File: ATM624Data.xlsx

Objectives

Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. Explain and demonstrate your process, techniques used and not used, and your actual forecast. Please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

 

Data Analysis And Transformation

1. Read in the data from the data file and take a look at it's structure.

# Read in the ATM dataset using the rio package's import() function.
ATMData <- import('./data/ATM624Data.xlsx', col_types = c('date', 'text', 'numeric'))

# Look at the raw data.
paged_table(ATMData, options = list(rows.print = 30))

A quick inspection of the original data reveals that there are missing values so we will remove rows with missing values, and also reformat the data so that it is easier to work with.

2. Clean and reformat the data so that each ATM has it's own column containing the cash withdrawal amounts for that ATM.

The first order of business is to remove columns that contain empty values.

# Drop rows containing empty values and plot the data.
ATMData <- ATMData %>% drop_na()

ggplot(ATMData, aes(DATE, Cash)) + geom_line() + facet_grid(ATM~., scales = 'free') +
  labs(title = 'Cash Withdrawals From ATM1, ATM2, ATM3, and ATM4', y = 'Withdrawals in Hundreds of dollars', x = 'Date')

The above plot reveals that:

  • There are no withdrawals from ATM3 until around the end of April 2010. Given the limited amount of data for this ATM, we may not be able to make an accurate forecast for this machine.

  • ATM4 has one major outlier around March 2010.

We will now reformat the data so that each ATM has its own column containing the cash withdrawal amounts for that ATM.

# Reformat the data using the tidyr package's spread() function.
ATMData <- ATMData %>% spread(ATM, Cash)

# Summarize the data after reformatting the data.
summary(ATMData)
##       DATE                 ATM1             ATM2             ATM3        
##  Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000  
##  1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 25.50   1st Qu.: 0.0000  
##  Median :2009-10-30   Median : 91.00   Median : 67.00   Median : 0.0000  
##  Mean   :2009-10-30   Mean   : 83.89   Mean   : 62.58   Mean   : 0.7206  
##  3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000  
##  Max.   :2010-04-30   Max.   :180.00   Max.   :147.00   Max.   :96.0000  
##                       NA's   :3        NA's   :2                         
##       ATM4          
##  Min.   :    1.563  
##  1st Qu.:  124.334  
##  Median :  403.839  
##  Mean   :  474.043  
##  3rd Qu.:  704.507  
##  Max.   :10919.762  
## 

The above summary of the reformatted data reveals that:

  • ATM1 and ATM2 contain empty values.

  • There is a major outlier for ATM4 - a cash withdrawal of 10919.762.

To further refine the data, we will remove the empty values from ATM1 and ATM2, and also replace ATM4's outlier with the median for that machine's data.

# Replace ATM4's withdrawal outlier of 10919.762 with the median for that ATM.
ATMData$ATM4[ATMData$ATM4==max(ATMData$ATM4)] <- median(ATMData$ATM4)

# Drop rows with empty values for ATM1 and ATM2.
ATMData <- ATMData %>% drop_na()

# Resummarize the data to see the effect of the changes.
summary(ATMData)
##       DATE                          ATM1             ATM2       
##  Min.   :2009-05-01 00:00:00   Min.   :  1.00   Min.   :  0.00  
##  1st Qu.:2009-08-03 18:00:00   1st Qu.: 73.00   1st Qu.: 25.00  
##  Median :2009-11-01 12:00:00   Median : 91.00   Median : 66.00  
##  Mean   :2009-10-31 20:28:00   Mean   : 84.11   Mean   : 62.37  
##  3rd Qu.:2010-01-30 06:00:00   3rd Qu.:108.00   3rd Qu.: 93.25  
##  Max.   :2010-04-30 00:00:00   Max.   :180.00   Max.   :147.00  
##       ATM3              ATM4         
##  Min.   : 0.0000   Min.   :   1.563  
##  1st Qu.: 0.0000   1st Qu.: 127.719  
##  Median : 0.0000   Median : 403.850  
##  Mean   : 0.7306   Mean   : 447.575  
##  3rd Qu.: 0.0000   3rd Qu.: 704.271  
##  Max.   :96.0000   Max.   :1712.075

The above summary confirms that the NA values have now been removed from ATM1 and ATM2, and the outlier for ATM4 has now been resolved. We will now re-plot the data to confirm that ATM4's outlier no longer exists.

# Replot the data to confirm that the outlier for ATM4 has been resolved. 
ATMDataCleaned <- ATMData %>% 
  pivot_longer(cols = starts_with('ATM'), names_to = 'ATM', values_to = 'Cash')

ggplot(ATMDataCleaned, aes(DATE, Cash)) + geom_line() + facet_grid(ATM~., scales = 'free') +
  labs(title = 'Cash Withdrawals From ATM1, ATM2, ATM3, and ATM4', y = 'Withdrawals in Hundreds of dollars', x = 'Date')

The above plot confirms that ATM4's outlier no longer exists.

We will now take a look at the reformatted data.

# Take a look at the reformatted data.
paged_table(ATMData, options = list(rows.print = 30))

Looking at the table above, we can see that the greatest amount of cash is withdrawn from ATM4, followed by ATM1 and ATM2, with close to zero withdrawals being made from ATM3 (ATM3 only has 3 cash withdrawals - 96, 82, and 85 hundred dollars).

Timeseries and ARIMA Modeling

Now that we have insight into the data, we can convert the dataframe into a time series and work on our models in preparation for forecasting. For this project, I will be using ARIMA models to forecast the data.

The first step is to convert the entire data set into a time series and then we will focus on the ARIMA models for each individual ATM machine.

# Convert the dataset to a timeseries using the ts() function.
ATMDataTimeSeries <- ts(ATMData %>% select(-DATE), start = 1, frequency = 7)

 

ATM 1
atmOne <- ATMDataTimeSeries[,1]
ggtsdisplay(atmOne,
            main = 'Cash Withdrawals for ATM1',
            ylab = 'Withdrawal Amount ($K)',
            xlab = 'Week')

The above series plot for ATM1 displays weekly seasonality. The ACF and PACF plots also contain lags that fall outside of the boundaries suggesting that the data is not white noise. This tells us that this is a non-stationary series.

ARIMA models require stationarity, so we will apply a KPSS Unit Root Test to confirm the findings of the above plots that suggest the data is not stationary.

atmOne %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.5005 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The KPSS Unit Root Test results above confirm that the data is not stationary as the test-statistic value is greater than the significance levels. To address this issue, we will try differencing the data at lag 7, and run the KPSS test again to see if the data becomes stationary.

atmOne %>% diff(1) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0085 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

As we can see from the KPSS Unit Root Test results above, the data is made stationary by differencing at lag 7.

ATM1 ARIMA Modeling

In order to select the most appropriate ARIMA model for forecasting, we will use the auto.arima() function to suggest the best model to use.

# Run the series through the auto.arima() function and have it automatically select an appropriate ARIMA model. 
lambda <- BoxCox.lambda(atmOne)
atmOneAutoArimaSelection <- auto.arima(atmOne, lambda = lambda)

# Print out the results of the selection. 
summary(atmOneAutoArimaSelection)
## Series: atmOne 
## ARIMA(2,0,0)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2380317 
## 
## Coefficients:
##          ar1      ar2     sma1
##       0.1058  -0.1586  -0.5660
## s.e.  0.0525   0.0530   0.0469
## 
## sigma^2 estimated as 1.723:  log likelihood=-596.76
## AIC=1201.53   AICc=1201.64   BIC=1216.99
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 2.257459 26.82766 17.42781 -90.23328 110.3011 0.9545411 0.02816369

The auto.arima() function has selected an ARIMA(2,0,0)(0,1,1)[7] model which is seasonal and gels with our KPSS Unit Root Test findings, so we will use this to forecast how much cash will be taken out of ATM1 in May 2010.

ATM1 May 2010 Cash Withdrawal Forecast.
# Forecast cash withdrawals form ATM1 during the month of May 2010 and plot the results.
atmOneForecast <- forecast(atmOneAutoArimaSelection, 31, level = 95)
autoplot(atmOneForecast) +
  labs(title = 'ARIMA(2,0,0)(0,1,1)[7] Forecast',
       subtitle = 'Forecast of Cash Withdrawals From ATM1 For The Month of May 2010') +
  scale_y_continuous('Withdrawals In Hundreds of Dollars', labels = scales::dollar_format(scale = 0.1, suffix = 'K')) +
  xlab('Days') 

The above forecast looks good as it follows the general trend of the data.

Export The Forecast To A CSV File
atmOneForecast %>% write.csv('./data/SHaslettPartAatmOneForecast.csv', row.names = FALSE)

 

ATM 2

atmTwo <- ATMDataTimeSeries[,2]
ggtsdisplay(atmOne,
            main = 'Cash Withdrawals for ATM2',
            ylab = 'Withdrawal Amount ($K)',
            xlab = 'Week')

The ACF and PACF plots for ATM2 above contain lags that fall outside of the boundaries suggesting that the data is not white noise. This tells us that this is a non-stationary series.

we will apply a KPSS Unit Root Test to confirm that the data is not stationary.

atmTwo %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 2.0498 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The KPSS Unit Root Test results above confirm that the data is not stationary as the test-statistic value is greater than the significance levels. We will try differencing the data at lag 7, and run the KPSS test again to see if the data becomes stationary.

atmTwo %>% diff(1) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0148 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The KPSS Unit Root Test results above confirm that the data is made stationary by differencing at lag 7.

ATM2 ARIMA Modeling

In order to select the most appropriate ARIMA model for forecasting, we will use the auto.arima() function to suggest the best model to use.

# Run the series through the auto.arima() function and have it automatically select an appropriate ARIMA model. 
lambda <- BoxCox.lambda(atmTwo)
atmTwoAutoArimaSelection <- auto.arima(atmTwo, lambda = lambda)

# Print out the results of the selection. 
summary(atmTwoAutoArimaSelection)
## Series: atmTwo 
## ARIMA(3,0,4)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.6382776 
## 
## Coefficients:
##           ar1     ar2     ar3     ma1      ma2      ma3      ma4     sma1
##       -0.7637  0.6012  0.6620  0.8389  -0.7285  -0.9425  -0.0163  -0.4198
## s.e.   0.0866  0.0782  0.0594  0.1029   0.0859   0.0739   0.0811   0.0549
## 
## sigma^2 estimated as 39.58:  log likelihood=-1148.69
## AIC=2315.37   AICc=2315.9   BIC=2350.17
## 
## Training set error measures:
##                      ME     RMSE     MAE  MPE MAPE      MASE        ACF1
## Training set -0.1344195 26.11895 18.5345 -Inf  Inf 0.8548052 -0.02914399

The auto.arima() function has selected an ARIMA(2,0,0)(0,1,1)[7] model, so we will use this to forecast how much cash will be taken out of ATM2 in May 2010.

ATM2 May 2010 Cash Withdrawal Forecast.
# Forecast cash withdrawals form ATM2 during the month of May 2010 and plot the results.
atmTwoForecast <- forecast(atmTwoAutoArimaSelection, 31, level = 95)
autoplot(atmTwoForecast) +
  labs(title = 'ARIMA(3,0,4)(0,1,1)[7] Forecast',
       subtitle = 'Forecast of Cash Withdrawals From ATM2 For The Month of May 2010') +
  scale_y_continuous('Withdrawals In Hundreds of Dollars', labels = scales::dollar_format(scale = 0.1, suffix = 'K')) +
  xlab('Days') 

Export The Forecast To A CSV File
atmTwoForecast %>% write.csv('./data/SHaslettPartAatmTwoForecast.csv', row.names = FALSE)

 

ATM 3
atmThree <- ATMDataTimeSeries[,3]
ggtsdisplay(atmThree,
            main = 'Cash Withdrawals for ATM3',
            ylab = 'Withdrawal Amount ($K)',
            xlab = 'Week')

There is not enough data for ATM3 to make an accurate forecast model, so we will use a random walk with drift to make a forecast.

ATM3 May 2010 Cash Withdrawal Forecast.
# Forecast cash withdrawals from ATM3 during the month of May 2010 using a random walk with drift and plot the results.
atmThreeForecast <- rwf(atmThree, h = 31, drift=TRUE)
autoplot(atmThreeForecast) +
  labs(title = 'Random Walk With Drift Forecast For ATM3') +
  scale_y_continuous('Withdrawals In Hundreds of Dollars', labels = scales::dollar_format(scale = 0.1, suffix = 'K')) +
  xlab('Days') 

Export The Forecast To A CSV File
atmThreeForecast %>% write.csv('./data/SHaslettPartAatmThreeForecast.csv', row.names = FALSE)

 

ATM 4
atmFour <- ATMDataTimeSeries[,4]
ggtsdisplay(atmFour,
            main = 'Cash Withdrawals for ATM4',
            ylab = 'Withdrawal Amount ($K)',
            xlab = 'Week')

The ACF and PACF plots for ATM4 above show us that most of the lags are within the boundaries so the data is most likely stationary. To confirm this, we will apply a KPSS Unit Root Test.

atmFour %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.1506 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The above KPSS Unit Root Test results above confirm that the data for ATM4 is indeed stationary.

ATM4 ARIMA Modeling

In order to select the most appropriate ARIMA model for forecasting, we will use the auto.arima() function to suggest the best model to use.

# Run the series through the auto.arima() function and have it automatically select an appropriate ARIMA model. 
lambda <- BoxCox.lambda(atmFour)
atmFourAutoArimaSelection <- auto.arima(atmFour, lambda = lambda)

# Print out the results of the selection. 
summary(atmFourAutoArimaSelection)
## Series: atmFour 
## ARIMA(1,0,1)(2,0,1)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.4158853 
## 
## Coefficients:
##           ar1     ma1    sar1    sar2     sma1     mean
##       -0.5873  0.6667  0.8962  0.0086  -0.7347  24.6601
## s.e.   0.2620  0.2402  0.1042  0.0702   0.0902   1.4837
## 
## sigma^2 estimated as 118.7:  log likelihood=-1368.66
## AIC=2751.31   AICc=2751.63   BIC=2778.52
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 89.37816 350.3995 271.4736 -360.4888 405.2597 0.7860176 0.01588157

The auto.arima() function has selected an ARIMA(1,0,1)(2,0,1)[7]] model, so we will use this to forecast how much cash will be taken out of ATM4 in May 2010.

ATM4 May 2010 Cash Withdrawal Forecast.
# Forecast cash withdrawals form ATM4 during the month of May 2010 and plot the results.
atmFourForecast <- forecast(atmFourAutoArimaSelection, 31, level = 95)
autoplot(atmFourForecast) +
  labs(title = 'ARIMA(1,0,1)(2,0,1)[7] Forecast',
       subtitle = 'Forecast of Cash Withdrawals From ATM4 For The Month of May 2010') +
  scale_y_continuous('Withdrawals In Hundreds of Dollars', labels = scales::dollar_format(scale = 0.1, suffix = 'K')) +
  xlab('Days') 

Export The Forecast To A CSV File
atmFourForecast %>% write.csv('./data/SHaslettPartAatmFourForecast.csv', row.names = FALSE)

 

Part B – Forecasting Power

Data File: ResidentialCustomerForecastLoad-624.xlsx

Objectives

Part B consists of a simple data set of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

 

Data Analysis And Transformation

1. Read in the data from the data file and take a look at it's structure.

residentialPowerUsage <- import('./data/ResidentialCustomerForecastLoad-624.xlsx')
paged_table(residentialPowerUsage, options = list(rows.print = 30))
# Summarize the data.
summary(residentialPowerUsage)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

2. Data Transformation

Looking at the summary of the raw data above, we can see that the KWH column contains one empty value, and one major outlier of 770523 KWH. It also looks like the date format will be problematic when plotting and forecasting the data. Before taking care of the empty value and outlier, we will reformatting the date format so we can plot the data to inspect it visually.

# Format the date column into a compatible format.
residentialPowerUsage$`YYYY-MMM` <- paste0(residentialPowerUsage$`YYYY-MMM`, '-01')
residentialPowerUsage$date <- ymd(residentialPowerUsage$`YYYY-MMM`)

# Plot the data.
ggplot(residentialPowerUsage, aes(date, KWH)) + geom_line() + 
  labs(title = 'Residential Power Usage', y = 'Power Consumption In kWh', x = 'Time')

Th above plot confirms the outlier visually.

To take care of both the missing value and the outlier, I will impute both values with the mean of the data for the month in which they occur.

# The missing value occurs in September, 2008, so imput the
# missing value with the mean of the data for that month.
residentialPowerUsage$Month <- month(residentialPowerUsage$date)
residentialPowerUsage$KWH[is.na(residentialPowerUsage$KWH)] <- mean(residentialPowerUsage$KWH[residentialPowerUsage$Month == 9], na.rm = TRUE)

# The outlier occurs in July, 2010, so imput the outlier with the mean of the data for that month.
residentialPowerUsage$KWH[residentialPowerUsage$KWH == min(residentialPowerUsage$KWH)] <- mean(residentialPowerUsage$KWH[residentialPowerUsage$Month == 7], na.rm = TRUE)

Now that we have imputed the data to take care of both the outlier and the missing value, we will summarize and re-plot the data to confirm that the issues are indeed resolved.

# Summarize and plot the data to confirm that both the missing value and outlier have been taken care.
summary(residentialPowerUsage)
##   CaseSequence     YYYY-MMM              KWH                date           
##  Min.   :733.0   Length:192         Min.   : 4313019   Min.   :1998-01-01  
##  1st Qu.:780.8   Class :character   1st Qu.: 5443502   1st Qu.:2001-12-24  
##  Median :828.5   Mode  :character   Median : 6351262   Median :2005-12-16  
##  Mean   :828.5                      Mean   : 6543304   Mean   :2005-12-15  
##  3rd Qu.:876.2                      3rd Qu.: 7649733   3rd Qu.:2009-12-08  
##  Max.   :924.0                      Max.   :10655730   Max.   :2013-12-01  
##      Month      
##  Min.   : 1.00  
##  1st Qu.: 3.75  
##  Median : 6.50  
##  Mean   : 6.50  
##  3rd Qu.: 9.25  
##  Max.   :12.00
# Plot the data after imputation.
ggplot(residentialPowerUsage, aes(date, KWH)) + geom_line() + 
  labs(title = 'Residential Power Usage After Imputation', y = 'Power Consumption In kWh', x = 'Time')

The above summary and plot confirms that the imputation has taken care of both the missing value, and the outlier.

 

Timeseries and Modeling

Now that we have cleansed the data, we can convert the data into a time series and work on our ARIMA model in preparation for creating a 2014 monthly forecast.

1. Convert the data into a time series.

# Convert the data to a time series..
residentialPowerUsageTimeSeries <- ts(residentialPowerUsage[,'KWH'], start = c(1998, 1), frequency = 12)

2. Analyze the time series.

# Plot the time series data.
autoplot(residentialPowerUsageTimeSeries) +
  labs(title = 'Monthly Residential Power Usage', subtitle = 'January 1998 Through December 2013') +
  ylab('Power Consumption In kWh') +
  xlab('Month') 

The above plot reveals that the time series is seasonal, and displays peaks roughly every 6 months. This is probably due to the fact that more electricity is consumed in the winter and summer months. The series also displays an upward trend.

ARIMA Modeling

In order to select the most appropriate ARIMA model for forecasting, we will use the auto.arima() function.

# Run the series through the auto.arima() function and have it automatically select an appropriate ARIMA model. 
lambda <- BoxCox.lambda(residentialPowerUsageTimeSeries)
residentialPowerConsumptionAutoArimaSelection <- auto.arima(residentialPowerUsageTimeSeries, lambda = lambda)

# Print out the results of the selection. 
summary(residentialPowerConsumptionAutoArimaSelection)
## Series: residentialPowerUsageTimeSeries 
## ARIMA(0,0,1)(2,1,0)[12] with drift 
## Box Cox transformation: lambda= -0.2018638 
## 
## Coefficients:
##          ma1     sar1     sar2   drift
##       0.2444  -0.7265  -0.3599  0.0001
## s.e.  0.0828   0.0750   0.0766  0.0001
## 
## sigma^2 estimated as 0.00001647:  log likelihood=742.97
## AIC=-1475.95   AICc=-1475.6   BIC=-1459.98
## 
## Training set error measures:
##                 ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 48157 626456.3 477342.9 0.1542706 7.247545 0.7798858 0.09981818

The auto.arima() function has selected an ARIMA(0,0,1)(2,1,0)[12] with drift model, so we will use this to create our 2014 monthly power consumption forecast.

2014 Monthly Power Consumption Forecast
# Forecast monthly power consumption for 2014 and plot the results.
residentialPowerConsumptionForecast <- forecast(residentialPowerConsumptionAutoArimaSelection, 31, level = 95)
autoplot(residentialPowerConsumptionForecast) +
  labs(title = 'ARIMA(0,0,1)(2,1,0)[12] With Drift Forecast',
       subtitle = '2014 Monthly Power Consumption Forecast') +
  ylab('Power Consumption In kWh') +
  xlab('Month')

The above forecast looks really good and follows the data trend perfectly.

Export The Forecast To A CSV File
residentialPowerConsumptionForecast %>% write.csv('./data/SHaslettPartBresidentialPowerConsumptionForecast.csv', row.names = FALSE)

Part C – BONUS

Data Files: Waterflow_Pipe1.xlsx, Waterflow_Pipe2.xlsx

Objectives

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour. Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Data Analysis And Transformation

1. Read in the data from the data files and take a look at it's structure.

# Import the 2 data files.
waterFlowPipeOne <- import('./data//Waterflow_Pipe1.xlsx', col_types = c("date", 'numeric'))
waterFlowPipeTwo <- import('./data//Waterflow_Pipe2.xlsx', col_types = c('date', 'numeric'))

Now that we have imported the data, lets look at the structure of both data sets.

Water Flow Pipe 1 Data Structure

# Look at the data structure for pipe 1.
paged_table(waterFlowPipeOne, options = list(rows.print = 15))

Water Flow Pipe 2 Data Structure

# Look at the data structure for pipe 1.
paged_table(waterFlowPipeTwo, options = list(rows.print = 15))

Looking at the structure of the 2 data sets above, we can see that both files are measuring water flow. However, they are recording the data at different time intervals. The Water Flow Pipe 1 data set records the data at irregular intervals where as the Water Flow Pipe 2 data set records the data every hour. The fact that Water Flow Pipe 2 records the data at regular intervals makes our lives easier as we can join the 2 data sets and then group by mean water flow per hour.

Aggregate The Two Data Sets

2. Prepare the waterFlowPipeOne data set for aggregation.

# Remove the space from the Date Time column names to make binding easier.
colnames(waterFlowPipeOne) <- c('DateTime', 'WaterFlow')
colnames(waterFlowPipeTwo) <- c('DateTime', 'WaterFlow')

# Prep the waterFlowPipeOne data set for aggregation and make
# the timestamp conform to that of waterFlowPipeTwo. 
waterFlowPipeOne <- waterFlowPipeOne %>% 
                      mutate(Date = as.Date(DateTime), Time = hour(DateTime) + 1) %>%
                      group_by(Date, Time) %>%
                      summarise(WaterFlow = mean(WaterFlow)) %>%
                      ungroup() %>%
                      mutate(DateTime = ymd_h(paste(Date, Time))) %>%
                      select(DateTime, WaterFlow)

# Re-examine the dataset after transformation to ensure that the timestamps match those of waterFlowPipeTwo. 
paged_table(waterFlowPipeOne, options = list(rows.print = 20))

After transforming the waterFlowPipeOne data set, we can see from the above table that the timestamp now matches that of waterFlowPipeTwo. We can now aggregate the 2 data sets.

3. Aggregate the two data sets.

# Combine the waterFlowPipeOne data set with waterFlowPipeTwo.
waterFlowPipes <- full_join(waterFlowPipeOne, waterFlowPipeTwo, by = 'DateTime', suffix = c('_PipeOne', '_PipeTwo')) %>%
  mutate(WaterFlow_PipeOne = ifelse(is.na(WaterFlow_PipeOne), 0, WaterFlow_PipeOne)) %>%
  mutate(WaterFlow_PipeTwo = ifelse(is.na(WaterFlow_PipeTwo), 0, WaterFlow_PipeTwo)) %>%
  mutate(WaterFlow = WaterFlow_PipeOne + WaterFlow_PipeTwo) %>%
  select(DateTime, WaterFlow)

# Re-examine the dataset after aggregation. 
paged_table(waterFlowPipes, options = list(rows.print = 30))
Convert The Aggregated Data To A Time Series And Explore The Data.

Now that the data has been aggregated, we will convert it to a time series and check for stationarity.

# Convert the aggregated data to a time series and plot the results.
waterFlowTimeSeries <- ts(waterFlowPipes$WaterFlow)
autoplot(waterFlowTimeSeries) +
  labs(title = 'Water Flow Time Series') +
  ylab('Waterflow Measurement') +
  xlab('Time') 

Looking at the variance in the Water Flow time series above, it is unlikely that the data is stationary. To confirm this, we will perform a KPSS Unit root test.

# Perform a  KPSS Unit root test on the time series to check for stationarity.
waterFlowTimeSeries %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 5.5327 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The results of the KPSS test confirm that the data is not stationary as the test-statistic value is greater than the significance levels. To address this issue, we will try differencing the data once at lag 1, and run the KPSS test again to see if the data becomes stationary.

# Difference the data and retest.
waterFlowTimeSeries %>% diff() %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0092 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The above test results confirm that the data has become stationary after differencing. To confirm this visually, we will plot the differenced time series.

# Replot the time series after differencing to check for stationarity.
waterFlowTimeSeriesDifferenced <- waterFlowTimeSeries %>% diff()
autoplot(waterFlowTimeSeriesDifferenced) +
  labs(title = 'Stationary Water Flow Time Series After Differencing')

The above plot visually confirms that the data is stationary after differencing.

ARIMA Modeling and Forecasting

Now that we have confirmed that that data in stationary, we will use the auto.arima() function to select a non-seasonal ARIMA model to provide a week forward forecast of water flow measurements.

# Use the auto.arima() function to select a non-seasonal ARIMA model for forecasting..
waterFlowAutoArima <- auto.arima(waterFlowTimeSeries, seasonal = FALSE)
waterFlowAutoArima
## Series: waterFlowTimeSeries 
## ARIMA(1,1,1) 
## 
## Coefficients:
##           ar1      ma1
##       -0.0512  -0.9493
## s.e.   0.0334   0.0115
## 
## sigma^2 estimated as 268.4:  log likelihood=-4211.09
## AIC=8428.18   AICc=8428.21   BIC=8442.9

The auto.arima() function has selected an ARIMA(1,1,1) model, so we will use this to forecast the water flow measurements.

# Forecast the data using the auto selected ARIMA model.
waterFlowForecast <- forecast(waterFlowAutoArima, h = 7 * 24)

# Check the risiduals.
checkresiduals(waterFlowForecast)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 2.1151, df = 8, p-value = 0.9773
## 
## Model df: 2.   Total lags used: 10
# Plot the forecast.
autoplot(waterFlowForecast) +
  labs(title = 'ARIMA(1,1,1) Water Flow Forecast',
       subtitle = 'Week Forward Forecast Of Water Flow Measurements',
       y = 'Water Flow',
       x = 'Day')

Looking at the above results, The ARIMA(1,1,1) model seems to be a good fit for this forecast. Most of the autocorrelations in the ACF plot are within the boundaries, and the p-value generated by the Ljung-Box is above the significance level. The forecast plot is also satisfactory so we will move on to export the forecast as a CSV file.

Export The Forecast To A CSV File
waterFlowForecast %>% write.csv('./data/SHaslettPartCWaterFlow.csv', row.names = FALSE)