DATA624: Predictive Analytics: Project 1

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

The first step is to read in the data and examine it to determine the next steps. We are reading in the data, converting it to a dataframe and converting the date into readable format.

setwd("/Users/elinaazrilyan/Downloads/")
atmdata <- readxl::read_excel("ATM624Data.xlsx", skip=0)
atmdf<-as.data.frame(atmdata)
atmdf$DATE <- as.Date(atmdf$DATE, origin = "1899-12-30")
summary(atmdf)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

Preparing the Data

We see that there are NAs, so the next step is to drop the rows with missing ATM and Cash values.

atmdf <- atmdf %>% filter((!(is.na(.$ATM))))
summary(atmdf)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5  
##  Median :2009-10-30   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8  
##                                          NA's   :5

We can see that we have 1 year worth of data that ranges from 5/1/2009 to 4/30/2010. Cash withdrawals range from 0 to 1,091,980 which seems suspiciously high if we look at the mean (15,560) and median (7,300) values.

We will now separate the data for the 4 ATMs and examine those individually. We will aslo replace NAs with Median values for each ATM.

atm1 <- subset(atmdf, ATM == "ATM1")
atm2 <- subset(atmdf, ATM == "ATM2")
atm3 <- subset(atmdf, ATM == "ATM3")
atm4 <- subset(atmdf, ATM == "ATM4")

atm1$Cash <- atm1$Cash %>% replace(is.na(.), median(atm1$Cash, na.rm=T))
summary(atm1)

##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  1.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 73.00  
##  Median :2009-10-30   Mode  :character   Median : 91.00  
##  Mean   :2009-10-30                      Mean   : 83.95  
##  3rd Qu.:2010-01-29                      3rd Qu.:108.00  
##  Max.   :2010-04-30                      Max.   :180.00

atm2$Cash <- atm2$Cash %>% replace(is.na(.), median(atm2$Cash, na.rm=T))
summary(atm2)

##       DATE                ATM                 Cash      
##  Min.   :2009-05-01   Length:365         Min.   :  0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 26.0  
##  Median :2009-10-30   Mode  :character   Median : 67.0  
##  Mean   :2009-10-30                      Mean   : 62.6  
##  3rd Qu.:2010-01-29                      3rd Qu.: 93.0  
##  Max.   :2010-04-30                      Max.   :147.0

atm3$Cash <- atm3$Cash %>% replace(is.na(.), median(atm3$Cash, na.rm=T))
summary(atm3)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:365         Min.   : 0.0000  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 0.0000  
##  Median :2009-10-30   Mode  :character   Median : 0.0000  
##  Mean   :2009-10-30                      Mean   : 0.7206  
##  3rd Qu.:2010-01-29                      3rd Qu.: 0.0000  
##  Max.   :2010-04-30                      Max.   :96.0000

atm4$Cash <- atm4$Cash %>% replace(is.na(.), median(atm4$Cash, na.rm=T))
summary(atm4)

##       DATE                ATM                 Cash          
##  Min.   :2009-05-01   Length:365         Min.   :    1.563  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:  124.334  
##  Median :2009-10-30   Mode  :character   Median :  403.839  
##  Mean   :2009-10-30                      Mean   :  474.043  
##  3rd Qu.:2010-01-29                      3rd Qu.:  704.507  
##  Max.   :2010-04-30                      Max.   :10919.762

The next step is to convert the data into time series format and see if any additional adjustments and clean up will be needed.

atm1ts <- ts(atm1["Cash"], frequency = 7)
atm2ts <- ts(atm2["Cash"], frequency = 7)
atm3ts <- ts(atm3["Cash"], frequency = 7)
atm4ts <- ts(atm4["Cash"], frequency = 7)

autoplot(atm1ts)

autoplot(atm2ts)

autoplot(atm3ts)

autoplot(atm4ts)

I am concerned about a few things on the plots above - there is almost no data for ATM3 and there is a significant outlier in ATM4 data which is obvious from visually inspecting the data. I will now remove the outlier from ATM4 and replace it with the more realistic highest value.

tail(sort(atm4$Cash))

## [1]  1301.396  1484.127  1495.155  1574.779  1712.075 10919.762

atm4$Cash[atm4$Cash >=2000] <- 1712.075
tail(sort(atm4$Cash))

## [1] 1301.396 1484.127 1495.155 1574.779 1712.075 1712.075

atm4ts <- ts(atm4["Cash"], frequency = 7)
autoplot(atm4ts)

The data looks much more realistic and it is worth noting that the scale of daily withdrawals is much different from ATM1 and ATM2.

I have decided to fill missing ATM3 data with a mean value of ATM1 and ATM2, I believe it will work better than forecasting on only the 3 recent data points available.

for (i in 1:362)
{
  atm3$Cash[i] <- mean(c(atm1$Cash[i],atm2$Cash[i]))
}

atm3ts <- ts(atm3["Cash"], frequency = 7)
autoplot(atm3ts)

Analysis

ATM1

Let’s determine whether there is weekly seasonality in our data.

ggseasonplot(atm1ts, polar=TRUE) +
   ylab("ATM Withdrawals") +
   ggtitle("Polar seasonal plot: ATM 1")

atm1ts %>% decompose(type="additive") %>%
  autoplot() + xlab("Week") +
  ggtitle("Classical additive decomposition of ATM 1")

It is obvious form the plot above that there is seasonality and no trend. I am a little concerned about the pattern in the error plot - there is a lot of variation as we get into more recent data.

I will now use Holt-Winters’ additive method of seasonaly adjusted data with auto lambda value for BoxCox transformation.

atm1fit1 <- decompose(atm1ts, type="additive")
atm1fit1 <- atm1fit1 %>% seasadj() %>% hw(h=14, damped=TRUE, lambda = "auto") 
autoplot(forecast(atm1fit1))

accuracy(atm1fit1)

##                     ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 0.2515983 23.98107 15.41124 11.92344 64.67484 0.8703618
##                   ACF1
## Training set 0.1332457

Now let’s check residuals.

checkresiduals(atm1fit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from Damped Holt-Winters' additive method
## Q* = 24.449, df = 3, p-value = 2.013e-05
## 
## Model df: 12.   Total lags used: 15

From ACF and normal residuals distribution and p-value of Ljung-Box test we can conclude that the residuals are not distinguishable from a white noise series.

I am going to try an Arima model as well and see what results that yields.

atm1fit2 <- auto.arima(atm1ts)
atm1fit2

## Series: atm1ts 
## ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.1695  -0.5867  -0.0989
## s.e.  0.0553   0.0508   0.0497
## 
## sigma^2 estimated as 559.8:  log likelihood=-1641.12
## AIC=3290.23   AICc=3290.34   BIC=3305.75

autoplot(forecast(atm1fit2))

accuracy(atm1fit2)

##                       ME    RMSE      MAE       MPE     MAPE      MASE
## Training set -0.07363941 23.3332 14.60281 -102.7359 117.6027 0.8247052
##                      ACF1
## Training set -0.008134059

Now let’s check residuals for Arima model.

checkresiduals(atm1fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,2)[7]
## Q* = 16.789, df = 11, p-value = 0.1143
## 
## Model df: 3.   Total lags used: 14

From ACF and normal residuals distribution looks ok but p-value of Ljung-Box test is preventing us from concluding that the residuals are not distinguishable from a white noise series.

Even though Arima results in lower RMSE, I am going to use Holt-Winters’ forecast since residuals look better and the RMSE difference is not that significant.

atm1fcdf <- data.frame(forecast(atm1fit1, h=14))
write_csv(atm1fcdf, "ATM1_Result.csv")

ATM2

Let’s determine whether there is weekly seasonality in our data.

ggseasonplot(atm2ts, polar=TRUE) +
   ylab("ATM Withdrawals") +
   ggtitle("Polar seasonal plot: ATM 2")

atm2ts %>% decompose(type="additive") %>%
  autoplot() + xlab("Week") +
  ggtitle("Classical additive decomposition of ATM 2")

It is obvious form the plot above that there is seasonality and a little bit of a downward trend. The pattern in the error plot shows a lot of variation as we get into more recent data.

I will now use Holt-Winters’ additive method of seasonaly adjusted data with auto lambda value for BoxCox transformation.

atm2fit1 <- decompose(atm2ts, type="additive")
atm2fit1 <- atm2fit1 %>% seasadj() %>% hw(h=14, damped=TRUE, lambda = "auto") 
autoplot(forecast(atm2fit1))

accuracy(atm2fit1)

##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -4.438064 25.35359 18.42074 -2.799921 94.77185 0.8851845
##                    ACF1
## Training set 0.02623264

Now let’s check residuals.

checkresiduals(atm2fit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from Damped Holt-Winters' additive method
## Q* = 25.901, df = 3, p-value = 1e-05
## 
## Model df: 12.   Total lags used: 15

From ACF and normal residuals distribution and p-value of Ljung-Box test we can conclude that the residuals are not distinguishable from a white noise series.

I am going to try an Arima model as well and see what results that yields.

atm2fit2 <- auto.arima(atm2ts, lambda = "auto")
atm2fit2

## Series: atm2ts 
## ARIMA(3,0,3)(0,1,1)[7] with drift 
## Box Cox transformation: lambda= 0.7278107 
## 
## Coefficients:
##          ar1      ar2     ar3      ma1     ma2      ma3     sma1    drift
##       0.4890  -0.4982  0.8330  -0.4843  0.3277  -0.7863  -0.7141  -0.0204
## s.e.  0.0855   0.0742  0.0606   0.1057  0.0946   0.0615   0.0456   0.0074
## 
## sigma^2 estimated as 69.29:  log likelihood=-1265.19
## AIC=2548.39   AICc=2548.9   BIC=2583.31

autoplot(forecast(atm2fit2))

accuracy(atm2fit2)

##                   ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 1.44981 24.25163 17.12271 -Inf  Inf 0.8228094 0.005867841

Now let’s check residuals for Arima model.

checkresiduals(atm2fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,3)(0,1,1)[7] with drift
## Q* = 8.859, df = 6, p-value = 0.1817
## 
## Model df: 8.   Total lags used: 14

From ACF and normal residuals distribution looks ok but p-value of Ljung-Box test is preventing us from concluding that the residuals are not distinguishable from a white noise series.

Since Arima results in lower RMSE, I am going to use that model even though Holt-Winters’ forecast has better residuals.

atm2fcdf <- data.frame(forecast(atm2fit2, h=14))
write_csv(atm2fcdf, "ATM2_Result.csv")

ATM3

Let’s determine whether there is weekly seasonality in our data.

ggseasonplot(atm3ts, polar=TRUE) +
   ylab("ATM Withdrawals") +
   ggtitle("Polar seasonal plot: ATM 3")

atm3ts %>% decompose(type="additive") %>%
  autoplot() + xlab("Week") +
  ggtitle("Classical additive decomposition of ATM 3")

It is obvious form the plot above that there is seasonality and no trend. The pattern in the error plot shows a lot of variation as we get into more recent data.

I want to use Holt-Winters’ additive method and give more weight to recent data since that is the actual information we have from ATM 3 so I will use a high alpha parameter of 0.75.

I will now use Holt-Winters’ additive method of seasonaly adjusted data with auto lambda value for BoxCox transformation.

atm3fit1 <- decompose(atm3ts, type="additive")
atm3fit1 <- atm3fit1 %>% seasadj() %>% hw(h=14, alpha = 0.75, lambda = "auto") 
autoplot(forecast(atm3fit1))

accuracy(atm3fit1)

##                      ME     RMSE     MAE      MPE     MAPE     MASE
## Training set -0.6183751 31.87542 20.7277 22.12999 94.21412 1.309426
##                    ACF1
## Training set -0.1383788

checkresiduals(atm3fit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 166.19, df = 3, p-value < 2.2e-16
## 
## Model df: 11.   Total lags used: 14

ACF is suggesting residuals might be corelated but from the normal residuals distribution and p-value of Ljung-Box test we can assume that the residuals are not distinguishable from a white noise series.

I am going to try an Arima model as well and see what results that yields.

atm3fit2 <- auto.arima(atm3ts, lambda = "auto")
atm3fit2

## Series: atm3ts 
## ARIMA(2,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.7606979 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4636  -0.8721  0.5521  0.7807  -0.7124
## s.e.   0.0807   0.0528  0.1077  0.0720   0.0433
## 
## sigma^2 estimated as 62.98:  log likelihood=-1249.08
## AIC=2510.15   AICc=2510.39   BIC=2533.44

autoplot(forecast(atm3fit2))

accuracy(atm3fit2)

##                   ME     RMSE      MAE       MPE   MAPE      MASE
## Training set 0.17537 20.89342 13.80117 -56.17381 72.769 0.8718577
##                      ACF1
## Training set -0.008116625

Now let’s check residuals for Arima model.

checkresiduals(atm3fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.7159, df = 9, p-value = 0.374
## 
## Model df: 5.   Total lags used: 14

From ACF and normal residuals distribution looks ok but p-value of Ljung-Box test is preventing us from concluding that the residuals are not distinguishable from a white noise series.

Since Holt-Winters’ method allows us to give more weight to recent data I am going to use that model even though Arima forecast has a much lower RMSE.

atm3fcdf <- data.frame(forecast(atm3fit1, h=14))
write_csv(atm3fcdf, "ATM3_Result.csv")

ATM4

Let’s determine whether there is weekly seasonality in our data.

ggseasonplot(atm4ts, polar=TRUE) +
   ylab("ATM Withdrawals") +
   ggtitle("Polar seasonal plot: ATM 4")

atm4ts %>% decompose(type="additive") %>%
  autoplot() + xlab("Week") +
  ggtitle("Classical additive decomposition of ATM 4")

There is no clear seasonality or trend in ATM4 data.

I will now use Simple Exponential Smoothing method with auto lambda value for BoxCox transformation.

atm4fit1 <- ses(atm4ts, damped=TRUE, lambda = "auto", h=14)
autoplot(forecast(atm4fit1))

accuracy(atm4fit1)

##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 97.41179 369.7028 295.4643 -431.2928 479.2372 0.8430885
##                    ACF1
## Training set 0.06045465

Now let’s check residuals.

checkresiduals(atm4fit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 54.867, df = 12, p-value = 1.912e-07
## 
## Model df: 2.   Total lags used: 14

From ACF and normal residuals distribution and p-value of Ljung-Box test we can conclude that the residuals are not distinguishable from a white noise series.

I am going to try an Arima model as well and see what results that yields.

atm4fit2 <- auto.arima(atm4ts, lambda = "auto")
atm4fit2

## Series: atm4ts 
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.447289 
## 
## Coefficients:
##         sar1    sar2     mean
##       0.2157  0.1788  28.4276
## s.e.  0.0518  0.0522   1.1206
## 
## sigma^2 estimated as 175.5:  log likelihood=-1459.98
## AIC=2927.95   AICc=2928.06   BIC=2943.55

autoplot(forecast(atm4fit2))

accuracy(atm4fit2)

##                    ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 87.94535 359.4022 279.3686 -380.113 425.0579 0.7971604
##                    ACF1
## Training set 0.07698303

Now let’s check residuals for Arima model.

checkresiduals(atm4fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
## Q* = 14.337, df = 11, p-value = 0.2149
## 
## Model df: 3.   Total lags used: 14

From ACF and normal residuals distribution looks ok but p-value of Ljung-Box test is preventing us from concluding that the residuals are not distinguishable from a white noise series.

Since Arima results in lower RMSE, I am going to use that model.

atm4fcdf <- data.frame(forecast(atm4fit2, h=14))
write_csv(atm4fcdf, "ATM4_Result.csv")

Part B –Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Reading in the data from Excel:

setwd("/Users/elinaazrilyan/Downloads/")
powerdata <- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx", skip=0)
powerdf<-as.data.frame(powerdata)
summary(powerdf)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

After reviwing the summary, we will proceed with data clean-up.

Data clean up.

I will replace NA value with median of the data.

powerdf$KWH <- powerdf$KWH %>% replace(is.na(.), median(powerdf$KWH, na.rm=T))
summary(powerdf)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5434539  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6501333  
##  3rd Qu.:876.2                      3rd Qu.: 7608792  
##  Max.   :924.0                      Max.   :10655730

I will convert the data to time series and review the plot.

powerdfts <- ts(powerdf["KWH"], start = c(1998, 1), frequency = 12)
autoplot(powerdfts)

I will replace the visible outlier value with 75 percentile.

powerdf$KWH <- powerdf$KWH %>% replace(powerdf$KWH==min(powerdf$KWH), quantile(powerdf$KWH, 0.75))
summary(powerdf)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   : 4313019  
##  1st Qu.:780.8   Class :character   1st Qu.: 5443502  
##  Median :828.5   Mode  :character   Median : 6314472  
##  Mean   :828.5                      Mean   : 6536949  
##  3rd Qu.:876.2                      3rd Qu.: 7617591  
##  Max.   :924.0                      Max.   :10655730

powerdfts <- ts(powerdf["KWH"], start = c(1998, 1), frequency = 12)
autoplot(powerdfts)

ggAcf(powerdfts)

There is clear indication of seasonality in both of the plots above and there might be a bit of an upward trend.

autoplot(decompose(powerdfts))

Decomposition confirms seasonality and upward trend.

The polar plot makes it easy to see the seasonal pattern.

ggseasonplot(powerdfts, polar=TRUE) +
   ggtitle("Polar seasonal plot: POWER")

It is clear that power consumption increases during the summer and winter months when AC and Heat might be consumming additional electricity. The values are much lower during spring and fall months. Here is another illustration of that:

ggsubseriesplot(powerdfts) + 
  ylab("KWH") +
  ggtitle("Seasonal subseries plot: Power")

Analysis

Test and train data

I will split the data into test and train datasets - I will use 2 years of data for testing.

powerdfts.train <- window(powerdfts, end=c(2011,12))
powerdfts.test <- window(powerdfts, start=2012)

autoplot(powerdfts) +
  autolayer(powerdfts.train, series="Training") +
  autolayer(powerdfts.test, series="Test")

I am going to try a variety of methods and see which results in the best forecast.

Seasonal Naive

fc1 <- snaive(powerdfts.train,24)
accuracy(fc1, powerdfts.test)

##                     ME     RMSE      MAE        MPE     MAPE     MASE
## Training set  114098.3 805602.6 624581.7  0.9213942 9.544575 1.000000
## Test set     -272199.3 965567.1 742063.5 -3.8050476 9.295010 1.188097
##                    ACF1 Theil's U
## Training set  0.3472363        NA
## Test set     -0.1987993 0.5349666

checkresiduals(fc1)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 82.562, df = 24, p-value = 2.365e-08
## 
## Model df: 0.   Total lags used: 24

From ACF and normal residuals distribution and p-value of Ljung-Box test we can conclude that the residuals are not distinguishable from a white noise series.

Holt-Winters’

For my next forecast I will try to use Holt-Winters’ method.

fc2 <- decompose(powerdfts.train, type="additive")
fc2 <- powerdfts.train %>% hw(h=12, lambda = "auto") 
autoplot(forecast(fc2))

accuracy(fc2, powerdfts.test)

##                      ME     RMSE      MAE         MPE      MAPE     MASE
## Training set   22451.44 579996.7 446863.3  -0.3552597  6.961553 0.715460
## Test set     -828557.80 983466.9 828955.6 -11.9153503 11.920368 1.327217
##                    ACF1 Theil's U
## Training set 0.20847896        NA
## Test set     0.08904977 0.6768765

checkresiduals(fc2)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 45.879, df = 8, p-value = 2.506e-07
## 
## Model df: 16.   Total lags used: 24

From ACF and normal residuals distribution and p-value of Ljung-Box test we can conclude that the residuals are not distinguishable from a white noise series.

I am going to try an Arima model next using auto.arima.

fc3 <- auto.arima(powerdfts.train, lambda = "auto")
fc3

## Series: powerdfts.train 
## ARIMA(1,0,1)(2,1,0)[12] with drift 
## Box Cox transformation: lambda= -0.2079193 
## 
## Coefficients:
##          ar1      ma1     sar1     sar2  drift
##       0.8342  -0.6300  -0.7816  -0.4217  1e-04
## s.e.  0.1602   0.2144   0.0846   0.0814  1e-04
## 
## sigma^2 estimated as 1.405e-05:  log likelihood=658.19
## AIC=-1304.38   AICc=-1303.82   BIC=-1286.08

autoplot(forecast(fc3))

accuracy(forecast(fc3), powerdfts.test)

##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set   60279.67 604409.0 475986.4  0.3481779 7.429188 0.7620882
## Test set     -105222.67 784633.6 631177.1 -2.0058979 8.212349 1.0105597
##                    ACF1 Theil's U
## Training set 0.16860725        NA
## Test set     0.05313159  0.452865

Auto.arima came up with ARIMA(1,0,1)(2,1,0)[12] with drift and it looks very promising so far.

checkresiduals(fc3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(2,1,0)[12] with drift
## Q* = 31.121, df = 19, p-value = 0.03916
## 
## Model df: 5.   Total lags used: 24

From ACF and normal residuals distribution and p-value of Ljung-Box test we can conclude that the residuals are not distinguishable from a white noise series.

The last method I will try is an STL decomposition with ETS method:

fc4 <- stlf(powerdfts.train, method='ets', lambda = "auto")
autoplot(forecast(fc4))

accuracy(forecast(fc4), powerdfts.test)

##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set   75998.45 520886.4 395888.9  0.4939691 6.115148 0.6338464
## Test set     -402705.94 848414.3 705178.0 -6.1638482 9.454028 1.1290405
##                    ACF1 Theil's U
## Training set 0.15959805        NA
## Test set     0.08677054 0.4642975

checkresiduals(fc4)

## Warning in checkresiduals(fc4): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 41.309, df = 22, p-value = 0.00757
## 
## Model df: 2.   Total lags used: 24

From ACF and normal residuals distribution and p-value of Ljung-Box test we can conclude that the residuals are not distinguishable from a white noise series.

finalpower <- data.frame(rbind(accuracy(forecast(fc1), powerdfts.test), accuracy(forecast(fc2), powerdfts.test), accuracy(forecast(fc3), powerdfts.test), accuracy(forecast(fc4), powerdfts.test)))
rownames(finalpower) <- c("Seasonal Naive Train", "Seasonal Naive Test", "Holt-Winter's Train", "Holt-Winter's Test", "ARIMA Train", "ARIMA Test", "STL-ETS Train", "STL-ETS Test")
finalpower

##                              ME     RMSE      MAE         MPE      MAPE
## Seasonal Naive Train  114098.31 805602.6 624581.7   0.9213942  9.544575
## Seasonal Naive Test  -272199.29 965567.1 742063.5  -3.8050476  9.295010
## Holt-Winter's Train    22451.44 579996.7 446863.3  -0.3552597  6.961553
## Holt-Winter's Test   -828557.80 983466.9 828955.6 -11.9153503 11.920368
## ARIMA Train            60279.67 604409.0 475986.4   0.3481779  7.429188
## ARIMA Test           -105222.67 784633.6 631177.1  -2.0058979  8.212349
## STL-ETS Train          75998.45 520886.4 395888.9   0.4939691  6.115148
## STL-ETS Test         -402705.94 848414.3 705178.0  -6.1638482  9.454028
##                           MASE        ACF1 Theil.s.U
## Seasonal Naive Train 1.0000000  0.34723632        NA
## Seasonal Naive Test  1.1880967 -0.19879933 0.5349666
## Holt-Winter's Train  0.7154600  0.20847896        NA
## Holt-Winter's Test   1.3272171  0.08904977 0.6768765
## ARIMA Train          0.7620882  0.16860725        NA
## ARIMA Test           1.0105597  0.05313159 0.4528650
## STL-ETS Train        0.6338464  0.15959805        NA
## STL-ETS Test         1.1290405  0.08677054 0.4642975

The lowest RMSE is with ARIMA(1,0,1)(2,1,0)[12] with drift so that’s what I will use for my forecast.

fc_total <- auto.arima(powerdfts, lambda = "auto")
autoplot(forecast(fc_total))

powerfcdf <- data.frame(forecast(fc_total, h=12))
write_csv(powerfcdf, "Power_Result.csv")

Sources:

http://www.maths.qmul.ac.uk/~bb/TimeSeries/TS_Chapter4_6.pdf

https://robjhyndman.com/talks/RevolutionR/10-Seasonal-ARIMA.pdf

https://stackoverflow.com/questions/55732968/excel-date-time-conversion-to-posxict

https://blogs.ubc.ca/yiwang28/2017/05/04/my-r-learning-notes-quick-ways-to-aggregate-minutely-data-into-hourly-data/

DATA624: Predictive Analytics: Project 1

Elina Azrilyan

October 25, 2020

Part A – ATM Forecast

Preparing the Data

Analysis

ATM1

ATM2

ATM3

ATM4

Part B –Forecasting Power

Data clean up.

Analysis

Test and train data

Seasonal Naive

Holt-Winters’

Sources: