Project 1 This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Oct 25. I will accept late submissions with a penalty until the meetup after that when we review some projects.
ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
# read the data
atm <- read_excel("ATM624Data.xlsx", col_types = c("date", "text", "numeric")) %>%
mutate(DATE = as_date(DATE)) %>% #setup the date
arrange(DATE, ATM) #arrange by data and ATM
head(atm)## # A tibble: 6 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-01 ATM3 0
## 4 2009-05-01 ATM4 777.
## 5 2009-05-02 ATM1 82
## 6 2009-05-02 ATM2 89
tail(atm)## # A tibble: 6 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-05-09 <NA> NA
## 2 2010-05-10 <NA> NA
## 3 2010-05-11 <NA> NA
## 4 2010-05-12 <NA> NA
## 5 2010-05-13 <NA> NA
## 6 2010-05-14 <NA> NA
summary(atm)## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
From the above summary I can see the data is between May 2009 and April 2010 for four ATMs, the mean of the cash is 155.6, it looks there are outliers because the 10919.8 max which is way higher than the mean. and it looks there are a 0.0 transaction of min.
From the tail I can see there are some data with NA values. I will drop them below.
atm <- atm %>% drop_na()
tail(atm)## # A tibble: 6 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-04-29 ATM3 82
## 2 2010-04-29 ATM4 44.2
## 3 2010-04-30 ATM1 85
## 4 2010-04-30 ATM2 90
## 5 2010-04-30 ATM3 85
## 6 2010-04-30 ATM4 482.
Not lets visualize the cash per ATM throughout the whole period
atm %>% ggplot(aes(x =DATE, y = Cash, col=ATM)) +
geom_line()+
facet_wrap(~ATM, ncol=1, scales = "free_y")From the above I can see the cash transactions vary per ATM: ATM1 and ATM2 have relatively similar transaction type, they look they get effected by each day of the week and they are more predictive.
While ATM3 is more close to 0 value cash transaction except an outlier at the end of April 2010. For ATM4 is also predictive looks close to ATM1 and ATM3 but has outlier needed to be removed to preform the analysis.
Now lets vitalize the distributions of the ATM’s and see type of distribution
atm %>% ggplot(aes(x =Cash, fill = ATM)) +
geom_histogram(position="identity")+
facet_wrap(~ATM, scales = 'free') ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ATM1 is close to normal distribution, ATM2 is a relativity normal distribution where it has two peaks, while ATM3 is a close to zero distribution with outliers, and ATM4 a right skewed distribution with outliers.
Now to checkt the outliers I need to run a boxplot.
atm %>% ggplot(aes(y =Cash)) +
geom_boxplot() +
facet_wrap(~ATM, scales = 'free', nrow = 1) From the boxplots we can ATM1 has outliers, while ATM2 doesn’t have outliers. For ATM3 its zeros except for three outliers, and ATM4 has few outlier and one is very extreme.
I will remove the data related to ATM3 since they are zeros and spread the data throughout the ATM’s.
clean_atm <- atm %>% spread(ATM, Cash)
clean_atm <- select(clean_atm, -c(ATM3))
clean_atm$ATM1 <- tsclean(clean_atm$ATM1, replace.missing = TRUE)
clean_atm$ATM2 <- tsclean(clean_atm$ATM2, replace.missing = TRUE)
clean_atm$ATM4 <- tsclean(clean_atm$ATM4, replace.missing = TRUE)
summary(clean_atm)## DATE ATM1 ATM2 ATM4
## Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : 1.563
## 1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 26.00 1st Qu.: 124.334
## Median :2009-10-30 Median : 91.00 Median : 67.00 Median : 402.770
## Mean :2009-10-30 Mean : 84.15 Mean : 62.59 Mean : 444.757
## 3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 704.192
## Max. :2010-04-30 Max. :180.00 Max. :147.00 Max. :1712.075
In the above I removed I cleand the data and replaced the NA data and the extreme outlier for ATM4 got removed.
Now I explore the models per ATM and I will start with ATM1
#covert to tsibble object
clean_atm %>% as_tsibble(index = DATE) -> atm_ts
# decompose series
atm_ts %>% model(classical_decomposition(ATM1, type = "add")) %>%
components() %>%
autoplot()+
labs(title = "Classical decomposition for ATM1")## Warning: Removed 3 row(s) containing missing values (geom_path).
I don’t see any clear trend for the ATM1, for the random component it shows as unstable until teh last couple of months. for the ATM1 it looks more of a seasonality on a weekly basis.
Now I will check the corresponding time series, ACF, and PACF.
atm_ts %>% gg_tsdisplay(ATM1, plot_type = "partial")+
labs(title = "ATM1 untransformed time serise, ACF, and PACF.")From the series I can se no predictable patterns, and I can see in the ACF there is a seasonality on a 7, 14, 21 days, there is an auto correlation every 7 days.
ndiffs(atm_ts$ATM1)## [1] 0
It doesn’t show that ATM1 needs a box cox or non seasonal differncing I’ll use an automatic ARIMA model as compared to an ETS model, an ARIMA (x,0,x)(x,1,x) model where p,q, P, and Q are to be determined.
atm_ts$ATM1 %>% ets()## ETS(M,N,N)
##
## Call:
## ets(y = .)
##
## Smoothing parameters:
## alpha = 0.0326
##
## Initial states:
## l = 79.7942
##
## sigma: 0.4352
##
## AIC AICc BIC
## 4785.562 4785.628 4797.262
atm_ts %>% model(ARIMA(ATM1)) -> fit1
report(fit1)## Series: ATM1
## Model: ARIMA(0,0,1)(0,1,2)[7]
##
## Coefficients:
## ma1 sma1 sma2
## 0.1950 -0.5806 -0.1037
## s.e. 0.0546 0.0506 0.0494
##
## sigma^2 estimated as 556.4: log likelihood=-1640.01
## AIC=3288.03 AICc=3288.14 BIC=3303.55
From the above I can see that auto ARIMA showed an ARIMA (0,0,1)(0,1,2) model with an AIC of 3288 which was nearly 1.5x less than ETS model. and with a lower AIC I chose ARIMA model to predict ATM1.
# decompose series
atm_ts %>% model(classical_decomposition(ATM2, type = "add")) %>%
components() %>%
autoplot()+
labs(title = "Classical decomposition for ATM2")## Warning: Removed 3 row(s) containing missing values (geom_path).
I see slow downward trend the ATM2, for the random component it shows as unstable until the last couple of months. for the ATM2 it looks more of a seasonality on a weekly basis.
Now I will check the corresponding time series, ACF, and PACF.
atm_ts %>% gg_tsdisplay(ATM2, plot_type = "partial")+
labs(title = "ATM2 untransformed time serise, ACF, and PACF.")From the series I can se no predictable patterns, and I can see in the ACF there is a seasonality on a 7, 14, 21 days, there is an auto correlation every 7 days. I also see a spike at 2, and 5 days for both ACF and PACF.
We explore whether non-seasonal differencing is:
ndiffs(atm_ts$ATM2)## [1] 1
It doesn’t show that ATM2 needs a box cox seasonal differencing (D=1) certainly is, I’ll use an automatic ARIMA model as compared to an ETS model, an ARIMA (x,0,x)(x,1,x) model where p,q, P, and Q are to be determined.
atm_ts$ATM2 %>% ets()## ETS(A,A,N)
##
## Call:
## ets(y = .)
##
## Smoothing parameters:
## alpha = 7e-04
## beta = 7e-04
##
## Initial states:
## l = 78.8479
## b = -0.0342
##
## sigma: 38.286
##
## AIC AICc BIC
## 4820.352 4820.519 4839.851
atm_ts %>% model(ARIMA(ATM2)) -> fit2
report(fit2)## Series: ATM2
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4320 -0.9130 0.4773 0.8048 -0.7547
## s.e. 0.0553 0.0407 0.0861 0.0556 0.0381
##
## sigma^2 estimated as 602.5: log likelihood=-1653.67
## AIC=3319.33 AICc=3319.57 BIC=3342.61
From the above I can see that auto ARIMA showed an ARIMA (2,0,2)(0,1,1) model with an AIC of 3319 which was nearly 1.5x less than ETS model. and with a lower AIC I chose ARIMA model to predict ATM1.
ATM3 has only 3 data points to work with, I elected to take the mean of these 3 values to provide a forecast:
#select ATM3 column
clean_atm <- atm %>% spread(ATM, Cash)
clean_atm3 <- select(clean_atm, c(ATM3))
#select only non-zero rows
simple_atm3 <- tail(clean_atm3, n=3)
#take the mean
atm3_mean <- mean(simple_atm3$ATM3)
atm3_mean## [1] 87.66667
#forecast
atm3_forecast <- ts(rep(atm3_mean), start = 1, end = 31)The average value for ATM3 was 87.66667 and so we forecast this value for the entire month of May.
# decompose series
atm_ts %>% model(classical_decomposition(ATM4, type = "add")) %>%
components() %>%
autoplot()+
labs(title = "Classical decomposition for ATM4")## Warning: Removed 3 row(s) containing missing values (geom_path).
I don’t see and trend the ATM4, for the random component it shows as unstable. for the ATM2 it looks more of a seasonality on a weekly basis.
Now I will check the corresponding time series, ACF, and PACF.
atm_ts %>% gg_tsdisplay(ATM4, plot_type = "partial")+
labs(title = "ATM4 untransformed time serise, ACF, and PACF.")From the series I can se no predictable patterns, and I can see in the ACF there is a seasonality on a 7, 14, 21 days, there is an auto correlation every 7 days. I also see a spike at 2, and 5 days for both ACF and PACF.
We explore whether non-seasonal differencing is:
ndiffs(atm_ts$ATM4)## [1] 0
It doesn’t show that ATM4 needs a box cox seasonal differencing (D=1) certainly is, I’ll use an automatic ARIMA model as compared to an ETS model, an ARIMA (x,0,x)(x,1,x) model where p,q, P, and Q are to be determined.
atm_ts$ATM4 %>% ets()## ETS(A,N,N)
##
## Call:
## ets(y = .)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 444.8009
##
## sigma: 351.5848
##
## AIC AICc BIC
## 6437.046 6437.113 6448.746
atm_ts %>% model(ARIMA(ATM4)) -> fit4
report(fit4)## Series: ATM4
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sar1 constant
## -0.8778 -0.6769 0.1700 0.9634 0.8065 0.2484 795.0206
## s.e. 0.1004 0.1186 0.0564 0.0906 0.1079 0.0556 48.8937
##
## sigma^2 estimated as 117528: log likelihood=-2645.21
## AIC=5306.42 AICc=5306.82 BIC=5337.62
From the above I can see that auto ARIMA showed an ARIMA (3,0,2)(1,0,0) model with an AIC of 5306 which was less than ETS model. and with a lower AIC I chose ARIMA model to predict ATM1.
I will check the forecast for May of 2010, output all forecasts in one plot to verify the output, and then send forecasts to csv as requested of the submitter:
Here I will forecast only all ATM’s except ATM3.
#forecast
fit1 %>% forecast(h=31) -> atm1_forecast
fit2 %>% forecast(h=31) -> atm2_forecast
fit4 %>% forecast(h=31) -> atm4_forecast
#output all forecast plots in one
gridExtra::grid.arrange(
autoplot(atm1_forecast) +
labs(title = "ATM1: ARIMA(0,0,1)(0,1,2)", x = NULL, y = NULL) +
theme(legend.position = "none"),
autoplot(atm2_forecast) +
labs(title = "ATM2: ARIMA(2,0,2)(0,1,1)", x = NULL, y = "100s of Dollars") +
theme(legend.position = "none"),
autoplot(atm4_forecast) +
labs(title = "ATM4: ARIMA(3,0,2)(1,0,0)", x = "Day", y = NULL) +
theme(legend.position = "none"),
top = grid::textGrob("Forecasted ATM withdrawals for May of 2010\n")
) From the above forecasts, it appears that ATM1 and ATM2 forecasts may be accurate but ATM4 does not appear to follow the flow of the data up until May of 2010 and so my confidence in this forecast is minimal. Maybe a model outside the scope of this class would have performed better here …
ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
# read the data
rcf_data <- read_excel("ResidentialCustomerForecastLoad-624.xlsx") %>%
rename(month = `YYYY-MMM`)
head(rcf_data)## # A tibble: 6 × 3
## CaseSequence month KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
tail(rcf_data)## # A tibble: 6 × 3
## CaseSequence month KWH
## <dbl> <chr> <dbl>
## 1 919 2013-Jul 8415321
## 2 920 2013-Aug 9080226
## 3 921 2013-Sep 7968220
## 4 922 2013-Oct 5759367
## 5 923 2013-Nov 5769083
## 6 924 2013-Dec 9606304
summary(rcf_data)## CaseSequence month KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
It shows in the data the min KWH is 770523 and the mean is 6502475 while the max is 10655730 and there are one NA’s.
# Convert to time series
rcf <- ts(rcf_data$KWH, start = c(1998,1), frequency = 12)
summary(rcf)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
I will check the NA’s in the data using the library imputTS
library("imputeTS")
ggplot_na_distribution(rcf)Running the boxplot to see if there are any outlier.
boxplot(rcf_data$KWH, ylab = "KWH")The out of boxplot
out <- boxplot.stats(rcf_data$KWH)$out
out## [1] 770523
From the boxplot above I can see outlier in the data. I will zero the value and apply the same kNN 5
rcf_data <- rcf_data %>%
mutate(KWH = na_if(KWH, out))
rcf_data$KWH <- na_ma(rcf_data$KWH, k=5, weighting = "simple")
summary(rcf_data)## CaseSequence month KWH
## Min. :733.0 Length:192 Min. : 4313019
## 1st Qu.:780.8 Class :character 1st Qu.: 5443502
## Median :828.5 Mode :character Median : 6339797
## Mean :828.5 Mean : 6531803
## 3rd Qu.:876.2 3rd Qu.: 7608792
## Max. :924.0 Max. :10655730
Now we don’t have any missing data, and removing the outliers.
I will transform the data
rcf2 <- rcf_data
#ts as of 12 freq
tsrcf <- ts(rcf2$KWH, frequency = 12, start=c(1998,1))
ggseasonplot((tsrcf), labelgap = 1)ggsubseriesplot(tsrcf)Looking at the seasonal plot, we can see that the power usage increases in mid-summer and mid-winter across all years.
ggtsdisplay(tsrcf) Here we see the ACF and PACF plots we can see some seasonality within the data.
BOX-COX transform
rcf_lamda <- BoxCox.lambda(tsrcf)
rcf_rbc <- BoxCox(tsrcf, lambda = rcf_lamda)
ggseasonplot(rcf_rbc)ggsubseriesplot(rcf_rbc)ggtsdisplay(rcf_rbc)ndiffs(tsrcf)## [1] 1
ndiffs(rcf_rbc)## [1] 1
There not much major change in the transformation using BOX-COX.
No Box-Cox with a single Diff
ggtsdisplay(diff(tsrcf, 12), points = FALSE)Box-Cox with a single Diff
ggtsdisplay(diff(rcf_rbc, 12), points = FALSE)Now I will be using the ARIMA model on the data
rcf_model <- Arima(tsrcf, order = c(1,1,2), seasonal=c(1,1,2))
summary(rcf_model)## Series: tsrcf
## ARIMA(1,1,2)(1,1,2)[12]
##
## Coefficients:
## ar1 ma1 ma2 sar1 sma1 sma2
## -0.4594 -0.2116 -0.6683 0.1912 -1.0568 0.3482
## s.e. 0.2201 0.1826 0.1594 0.3456 0.3326 0.2376
##
## sigma^2 = 4.089e+11: log likelihood = -2650.19
## AIC=5314.39 AICc=5315.04 BIC=5336.7
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 42101.78 607001.7 449961.1 0.04052387 6.793162 0.7087288
## ACF1
## Training set 0.02217025
From the above I can see the model showed an ARIMA (1,1,2)(1,1,2) model with an AIC of 5314
Checking the residuals for the ARIMA model.
checkresiduals(rcf_model)##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2)(1,1,2)[12]
## Q* = 14.027, df = 18, p-value = 0.7273
##
## Model df: 6. Total lags used: 24
Running an Auto ARIMA for the trasform data.
auto_model <- auto.arima(tsrcf, d=1, approximation = FALSE)
summary(auto_model)## Series: tsrcf
## ARIMA(1,1,2)(2,1,0)[12]
##
## Coefficients:
## ar1 ma1 ma2 sar1 sar2
## -0.5466 -0.1434 -0.7342 -0.8017 -0.4634
## s.e. 0.1837 0.1480 0.1244 0.0753 0.0761
##
## sigma^2 = 4.037e+11: log likelihood = -2649.66
## AIC=5311.32 AICc=5311.81 BIC=5330.44
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 36857.45 604849.7 445553.1 -0.009880913 6.736162 0.7017858
## ACF1
## Training set 0.02896049
From the above I can see the model showed an auto ARIMA (1,1,2)(2,1,0) model with an AIC of 5311 which is little less than the ARIMA model
Now lets plot the residuals for the auto model with ACF and DFSY residuals.
checkresiduals(auto_model)##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2)(2,1,0)[12]
## Q* = 14.373, df = 19, p-value = 0.7615
##
## Model df: 5. Total lags used: 24
Plot the forecast for the automodel of ARIMA.
rcf_forecast <- forecast(auto_model, h = 12)
autoplot(rcf_forecast)Getting the mean foreasts of the data for 2014
signif(rcf_forecast$mean/1000000,3)## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2014 10.10 8.60 7.07 6.18 6.27 8.49 9.90 10.40 8.82 6.16 6.47 8.51
Predicting the data using the auto ARIMA model, I can see from the table of predicting the data for 2014 in all months, and it also shows the seasonality of the data where its goes up in mid-summer and mid-winder, its also shows samll trend across all years, and 2014 is one of the highest years of consuming power.
Lastly saving the file of KWH forecasts as CSV.
export <- rcf_forecast$mean
write.csv(export, "KWH Forecasts.csv")