Project1_624 Predictive Analytics

Project 1 This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Oct 25. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A – ATM Forecast

ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

# read the data
atm <- read_excel("ATM624Data.xlsx", col_types = c("date", "text", "numeric")) %>% 
  mutate(DATE = as_date(DATE)) %>% #setup the date
  arrange(DATE, ATM) #arrange by data and ATM
head(atm)

## # A tibble: 6 × 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1    96 
## 2 2009-05-01 ATM2   107 
## 3 2009-05-01 ATM3     0 
## 4 2009-05-01 ATM4   777.
## 5 2009-05-02 ATM1    82 
## 6 2009-05-02 ATM2    89

tail(atm)

## # A tibble: 6 × 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-05-09 <NA>     NA
## 2 2010-05-10 <NA>     NA
## 3 2010-05-11 <NA>     NA
## 4 2010-05-12 <NA>     NA
## 5 2010-05-13 <NA>     NA
## 6 2010-05-14 <NA>     NA

summary(atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

From the above summary I can see the data is between May 2009 and April 2010 for four ATMs, the mean of the cash is 155.6, it looks there are outliers because the 10919.8 max which is way higher than the mean. and it looks there are a 0.0 transaction of min.

From the tail I can see there are some data with NA values. I will drop them below.

atm <- atm %>% drop_na()

tail(atm)

## # A tibble: 6 × 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-04-29 ATM3   82  
## 2 2010-04-29 ATM4   44.2
## 3 2010-04-30 ATM1   85  
## 4 2010-04-30 ATM2   90  
## 5 2010-04-30 ATM3   85  
## 6 2010-04-30 ATM4  482.

Not lets visualize the cash per ATM throughout the whole period

atm %>% ggplot(aes(x =DATE, y = Cash, col=ATM)) +
  geom_line()+
  facet_wrap(~ATM, ncol=1, scales = "free_y")

From the above I can see the cash transactions vary per ATM: ATM1 and ATM2 have relatively similar transaction type, they look they get effected by each day of the week and they are more predictive.

While ATM3 is more close to 0 value cash transaction except an outlier at the end of April 2010. For ATM4 is also predictive looks close to ATM1 and ATM3 but has outlier needed to be removed to preform the analysis.

Now lets vitalize the distributions of the ATM’s and see type of distribution

atm %>% ggplot(aes(x =Cash, fill = ATM)) +
  geom_histogram(position="identity")+
  facet_wrap(~ATM, scales = 'free')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ATM1 is close to normal distribution, ATM2 is a relativity normal distribution where it has two peaks, while ATM3 is a close to zero distribution with outliers, and ATM4 a right skewed distribution with outliers.

Now to checkt the outliers I need to run a boxplot.

atm %>% ggplot(aes(y =Cash)) +
  geom_boxplot() +
  facet_wrap(~ATM, scales = 'free', nrow = 1)

From the boxplots we can ATM1 has outliers, while ATM2 doesn’t have outliers. For ATM3 its zeros except for three outliers, and ATM4 has few outlier and one is very extreme.

I will remove the data related to ATM3 since they are zeros and spread the data throughout the ATM’s.

clean_atm <- atm %>% spread(ATM, Cash)

clean_atm <- select(clean_atm, -c(ATM3))

clean_atm$ATM1 <- tsclean(clean_atm$ATM1, replace.missing = TRUE)
clean_atm$ATM2 <- tsclean(clean_atm$ATM2, replace.missing = TRUE)
clean_atm$ATM4 <- tsclean(clean_atm$ATM4, replace.missing = TRUE)

summary(clean_atm)

##       DATE                 ATM1             ATM2             ATM4         
##  Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   :   1.563  
##  1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 26.00   1st Qu.: 124.334  
##  Median :2009-10-30   Median : 91.00   Median : 67.00   Median : 402.770  
##  Mean   :2009-10-30   Mean   : 84.15   Mean   : 62.59   Mean   : 444.757  
##  3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 704.192  
##  Max.   :2010-04-30   Max.   :180.00   Max.   :147.00   Max.   :1712.075

In the above I removed I cleand the data and replaced the NA data and the extreme outlier for ATM4 got removed.

ATM1

Now I explore the models per ATM and I will start with ATM1

#covert to tsibble object
clean_atm %>% as_tsibble(index = DATE) -> atm_ts

# decompose series
atm_ts %>% model(classical_decomposition(ATM1, type = "add")) %>% 
  components() %>% 
  autoplot()+
  labs(title = "Classical decomposition for ATM1")

## Warning: Removed 3 row(s) containing missing values (geom_path).

I don’t see any clear trend for the ATM1, for the random component it shows as unstable until teh last couple of months. for the ATM1 it looks more of a seasonality on a weekly basis.

Now I will check the corresponding time series, ACF, and PACF.

atm_ts %>% gg_tsdisplay(ATM1, plot_type = "partial")+
  labs(title = "ATM1 untransformed time serise, ACF, and PACF.")

From the series I can se no predictable patterns, and I can see in the ACF there is a seasonality on a 7, 14, 21 days, there is an auto correlation every 7 days.

ndiffs(atm_ts$ATM1)

## [1] 0

It doesn’t show that ATM1 needs a box cox or non seasonal differncing I’ll use an automatic ARIMA model as compared to an ETS model, an ARIMA (x,0,x)(x,1,x) model where p,q, P, and Q are to be determined.

atm_ts$ATM1 %>% ets()

## ETS(M,N,N) 
## 
## Call:
##  ets(y = .) 
## 
##   Smoothing parameters:
##     alpha = 0.0326 
## 
##   Initial states:
##     l = 79.7942 
## 
##   sigma:  0.4352
## 
##      AIC     AICc      BIC 
## 4785.562 4785.628 4797.262

atm_ts %>% model(ARIMA(ATM1)) -> fit1

report(fit1)

## Series: ATM1 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.1950  -0.5806  -0.1037
## s.e.  0.0546   0.0506   0.0494
## 
## sigma^2 estimated as 556.4:  log likelihood=-1640.01
## AIC=3288.03   AICc=3288.14   BIC=3303.55

From the above I can see that auto ARIMA showed an ARIMA (0,0,1)(0,1,2) model with an AIC of 3288 which was nearly 1.5x less than ETS model. and with a lower AIC I chose ARIMA model to predict ATM1.

ATM2

# decompose series
atm_ts %>% model(classical_decomposition(ATM2, type = "add")) %>% 
  components() %>% 
  autoplot()+
  labs(title = "Classical decomposition for ATM2")

## Warning: Removed 3 row(s) containing missing values (geom_path).

I see slow downward trend the ATM2, for the random component it shows as unstable until the last couple of months. for the ATM2 it looks more of a seasonality on a weekly basis.

Now I will check the corresponding time series, ACF, and PACF.

atm_ts %>% gg_tsdisplay(ATM2, plot_type = "partial")+
  labs(title = "ATM2 untransformed time serise, ACF, and PACF.")

From the series I can se no predictable patterns, and I can see in the ACF there is a seasonality on a 7, 14, 21 days, there is an auto correlation every 7 days. I also see a spike at 2, and 5 days for both ACF and PACF.

We explore whether non-seasonal differencing is:

ndiffs(atm_ts$ATM2)

## [1] 1

It doesn’t show that ATM2 needs a box cox seasonal differencing (D=1) certainly is, I’ll use an automatic ARIMA model as compared to an ETS model, an ARIMA (x,0,x)(x,1,x) model where p,q, P, and Q are to be determined.

atm_ts$ATM2 %>% ets()

## ETS(A,A,N) 
## 
## Call:
##  ets(y = .) 
## 
##   Smoothing parameters:
##     alpha = 7e-04 
##     beta  = 7e-04 
## 
##   Initial states:
##     l = 78.8479 
##     b = -0.0342 
## 
##   sigma:  38.286
## 
##      AIC     AICc      BIC 
## 4820.352 4820.519 4839.851

atm_ts %>% model(ARIMA(ATM2)) -> fit2

report(fit2)

## Series: ATM2 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4320  -0.9130  0.4773  0.8048  -0.7547
## s.e.   0.0553   0.0407  0.0861  0.0556   0.0381
## 
## sigma^2 estimated as 602.5:  log likelihood=-1653.67
## AIC=3319.33   AICc=3319.57   BIC=3342.61

From the above I can see that auto ARIMA showed an ARIMA (2,0,2)(0,1,1) model with an AIC of 3319 which was nearly 1.5x less than ETS model. and with a lower AIC I chose ARIMA model to predict ATM1.

ATM3

ATM3 has only 3 data points to work with, I elected to take the mean of these 3 values to provide a forecast:

#select ATM3 column
clean_atm <- atm %>% spread(ATM, Cash)

clean_atm3 <- select(clean_atm, c(ATM3))



#select only non-zero rows
simple_atm3 <- tail(clean_atm3, n=3)

#take the mean
atm3_mean <- mean(simple_atm3$ATM3)
atm3_mean

## [1] 87.66667

#forecast
atm3_forecast <- ts(rep(atm3_mean), start = 1, end = 31)

The average value for ATM3 was 87.66667 and so we forecast this value for the entire month of May.

ATM4

# decompose series
atm_ts %>% model(classical_decomposition(ATM4, type = "add")) %>% 
  components() %>% 
  autoplot()+
  labs(title = "Classical decomposition for ATM4")

## Warning: Removed 3 row(s) containing missing values (geom_path).

I don’t see and trend the ATM4, for the random component it shows as unstable. for the ATM2 it looks more of a seasonality on a weekly basis.

Now I will check the corresponding time series, ACF, and PACF.

atm_ts %>% gg_tsdisplay(ATM4, plot_type = "partial")+
  labs(title = "ATM4 untransformed time serise, ACF, and PACF.")

We explore whether non-seasonal differencing is:

ndiffs(atm_ts$ATM4)

## [1] 0

It doesn’t show that ATM4 needs a box cox seasonal differencing (D=1) certainly is, I’ll use an automatic ARIMA model as compared to an ETS model, an ARIMA (x,0,x)(x,1,x) model where p,q, P, and Q are to be determined.

atm_ts$ATM4 %>% ets()

## ETS(A,N,N) 
## 
## Call:
##  ets(y = .) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 444.8009 
## 
##   sigma:  351.5848
## 
##      AIC     AICc      BIC 
## 6437.046 6437.113 6448.746

atm_ts %>% model(ARIMA(ATM4)) -> fit4

report(fit4)

## Series: ATM4 
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean 
## 
## Coefficients:
##           ar1      ar2     ar3     ma1     ma2    sar1  constant
##       -0.8778  -0.6769  0.1700  0.9634  0.8065  0.2484  795.0206
## s.e.   0.1004   0.1186  0.0564  0.0906  0.1079  0.0556   48.8937
## 
## sigma^2 estimated as 117528:  log likelihood=-2645.21
## AIC=5306.42   AICc=5306.82   BIC=5337.62

From the above I can see that auto ARIMA showed an ARIMA (3,0,2)(1,0,0) model with an AIC of 5306 which was less than ETS model. and with a lower AIC I chose ARIMA model to predict ATM1.

Forecast

I will check the forecast for May of 2010, output all forecasts in one plot to verify the output, and then send forecasts to csv as requested of the submitter:

Here I will forecast only all ATM’s except ATM3.

#forecast
fit1 %>% forecast(h=31) -> atm1_forecast
fit2 %>% forecast(h=31) -> atm2_forecast
fit4 %>% forecast(h=31) -> atm4_forecast

#output all forecast plots in one
gridExtra::grid.arrange(
  autoplot(atm1_forecast) + 
    labs(title = "ATM1: ARIMA(0,0,1)(0,1,2)", x = NULL, y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm2_forecast) + 
    labs(title = "ATM2: ARIMA(2,0,2)(0,1,1)", x = NULL, y = "100s of Dollars") +
    theme(legend.position = "none"),
  autoplot(atm4_forecast) + 
    labs(title = "ATM4: ARIMA(3,0,2)(1,0,0)", x = "Day", y = NULL) +
    theme(legend.position = "none"),
  top = grid::textGrob("Forecasted ATM withdrawals for May of 2010\n")
)

From the above forecasts, it appears that ATM1 and ATM2 forecasts may be accurate but ATM4 does not appear to follow the flow of the data up until May of 2010 and so my confidence in this forecast is minimal. Maybe a model outside the scope of this class would have performed better here …

Part B – Forecasting Power

ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

# read the data
rcf_data <- read_excel("ResidentialCustomerForecastLoad-624.xlsx") %>% 
  rename(month = `YYYY-MMM`)
head(rcf_data)

## # A tibble: 6 × 3
##   CaseSequence month        KWH
##          <dbl> <chr>      <dbl>
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147

tail(rcf_data)

## # A tibble: 6 × 3
##   CaseSequence month        KWH
##          <dbl> <chr>      <dbl>
## 1          919 2013-Jul 8415321
## 2          920 2013-Aug 9080226
## 3          921 2013-Sep 7968220
## 4          922 2013-Oct 5759367
## 5          923 2013-Nov 5769083
## 6          924 2013-Dec 9606304

summary(rcf_data)

##   CaseSequence      month                KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

It shows in the data the min KWH is 770523 and the mean is 6502475 while the max is 10655730 and there are one NA’s.

# Convert to time series

rcf <- ts(rcf_data$KWH, start = c(1998,1), frequency = 12)
summary(rcf)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1

I will check the NA’s in the data using the library imputTS

library("imputeTS")
ggplot_na_distribution(rcf)

Running the boxplot to see if there are any outlier.

boxplot(rcf_data$KWH, ylab = "KWH")

The out of boxplot

out <- boxplot.stats(rcf_data$KWH)$out
out

## [1] 770523

From the boxplot above I can see outlier in the data. I will zero the value and apply the same kNN 5

rcf_data <- rcf_data %>% 
  mutate(KWH = na_if(KWH, out))

rcf_data$KWH <- na_ma(rcf_data$KWH, k=5, weighting = "simple")

summary(rcf_data)

##   CaseSequence      month                KWH          
##  Min.   :733.0   Length:192         Min.   : 4313019  
##  1st Qu.:780.8   Class :character   1st Qu.: 5443502  
##  Median :828.5   Mode  :character   Median : 6339797  
##  Mean   :828.5                      Mean   : 6531803  
##  3rd Qu.:876.2                      3rd Qu.: 7608792  
##  Max.   :924.0                      Max.   :10655730

Now we don’t have any missing data, and removing the outliers.

I will transform the data

rcf2 <- rcf_data

#ts as of 12 freq
tsrcf <- ts(rcf2$KWH, frequency = 12, start=c(1998,1))

ggseasonplot((tsrcf), labelgap = 1)

ggsubseriesplot(tsrcf)

Looking at the seasonal plot, we can see that the power usage increases in mid-summer and mid-winter across all years.

ggtsdisplay(tsrcf)

Here we see the ACF and PACF plots we can see some seasonality within the data.

BOX-COX transform

rcf_lamda <- BoxCox.lambda(tsrcf)

rcf_rbc <- BoxCox(tsrcf, lambda = rcf_lamda)

ggseasonplot(rcf_rbc)

ggsubseriesplot(rcf_rbc)

ggtsdisplay(rcf_rbc)

ndiffs(tsrcf)

## [1] 1

ndiffs(rcf_rbc)

## [1] 1

There not much major change in the transformation using BOX-COX.

No Box-Cox with a single Diff

ggtsdisplay(diff(tsrcf, 12), points = FALSE)

Box-Cox with a single Diff

ggtsdisplay(diff(rcf_rbc, 12), points = FALSE)

Now I will be using the ARIMA model on the data

rcf_model <- Arima(tsrcf, order = c(1,1,2), seasonal=c(1,1,2))
summary(rcf_model)

## Series: tsrcf 
## ARIMA(1,1,2)(1,1,2)[12] 
## 
## Coefficients:
##           ar1      ma1      ma2    sar1     sma1    sma2
##       -0.4594  -0.2116  -0.6683  0.1912  -1.0568  0.3482
## s.e.   0.2201   0.1826   0.1594  0.3456   0.3326  0.2376
## 
## sigma^2 = 4.089e+11:  log likelihood = -2650.19
## AIC=5314.39   AICc=5315.04   BIC=5336.7
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 42101.78 607001.7 449961.1 0.04052387 6.793162 0.7087288
##                    ACF1
## Training set 0.02217025

From the above I can see the model showed an ARIMA (1,1,2)(1,1,2) model with an AIC of 5314

Checking the residuals for the ARIMA model.

checkresiduals(rcf_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2)(1,1,2)[12]
## Q* = 14.027, df = 18, p-value = 0.7273
## 
## Model df: 6.   Total lags used: 24

Running an Auto ARIMA for the trasform data.

auto_model <- auto.arima(tsrcf, d=1, approximation = FALSE)
summary(auto_model)

## Series: tsrcf 
## ARIMA(1,1,2)(2,1,0)[12] 
## 
## Coefficients:
##           ar1      ma1      ma2     sar1     sar2
##       -0.5466  -0.1434  -0.7342  -0.8017  -0.4634
## s.e.   0.1837   0.1480   0.1244   0.0753   0.0761
## 
## sigma^2 = 4.037e+11:  log likelihood = -2649.66
## AIC=5311.32   AICc=5311.81   BIC=5330.44
## 
## Training set error measures:
##                    ME     RMSE      MAE          MPE     MAPE      MASE
## Training set 36857.45 604849.7 445553.1 -0.009880913 6.736162 0.7017858
##                    ACF1
## Training set 0.02896049

From the above I can see the model showed an auto ARIMA (1,1,2)(2,1,0) model with an AIC of 5311 which is little less than the ARIMA model

Now lets plot the residuals for the auto model with ACF and DFSY residuals.

checkresiduals(auto_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2)(2,1,0)[12]
## Q* = 14.373, df = 19, p-value = 0.7615
## 
## Model df: 5.   Total lags used: 24

Plot the forecast for the automodel of ARIMA.

rcf_forecast <- forecast(auto_model, h = 12)
autoplot(rcf_forecast)

Getting the mean foreasts of the data for 2014

signif(rcf_forecast$mean/1000000,3)

##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 2014 10.10  8.60  7.07  6.18  6.27  8.49  9.90 10.40  8.82  6.16  6.47  8.51

Predicting the data using the auto ARIMA model, I can see from the table of predicting the data for 2014 in all months, and it also shows the seasonality of the data where its goes up in mid-summer and mid-winder, its also shows samll trend across all years, and 2014 is one of the highest years of consuming power.

Lastly saving the file of KWH forecasts as CSV.

export <- rcf_forecast$mean
write.csv(export, "KWH Forecasts.csv")

Project1_624 Predictive Analytics

Karim Hammoud

Part A – ATM Forecast

ATM1

ATM2

ATM3

ATM4

Forecast

Part B – Forecasting Power