Data 624 | Project1

Project 1

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Oct 25. I will accept late submissions with a penalty until the meetup after that when we review some projects. Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

library(readxl)
library(fpp3)
library(forecast)
library(ggplot2)
library(imputeTS)
library(readr)
library(summarytools)
library(tsibbledata)
library(lubridate)
library(DataExplorer)
library(ggfortify)
library(stats)
library(latex2exp)
library(aTSA)
library(zoo)

Part A – ATM Forecast

Load Data

#read in data and specify column types
atm_raw <- read_xlsx("ATM624Data.xlsx", col_types = c("date", "text", "numeric")) 

dfSummary(atm_raw, 
          plain.ascii  = FALSE, 
          style        = "grid", 
          graph.magnif = 0.75, 
          valid.col    = FALSE,
          tmp.img.dir  = "/tmp")

## ### Data Frame Summary  
## #### atm_raw  
## **Dimensions:** 1474 x 3  
## **Duplicates:** 0  
## 
## +----+-------------------+----------------------------+---------------------+----------------------+---------+
## | No | Variable          | Stats / Values             | Freqs (% of Valid)  | Graph                | Missing |
## +====+===================+============================+=====================+======================+=========+
## | 1  | DATE\             | min : 2009-05-01\          | 379 distinct values | ![](/tmp/ds0031.png) | 0\      |
## |    | [POSIXct, POSIXt] | med : 2009-11-01\          |                     |                      | (0.0%)  |
## |    |                   | max : 2010-05-14\          |                     |                      |         |
## |    |                   | range : 1y 0m 13d          |                     |                      |         |
## +----+-------------------+----------------------------+---------------------+----------------------+---------+
## | 2  | ATM\              | 1\. ATM1\                  | 365 (25.0%)\        | ![](/tmp/ds0032.png) | 14\     |
## |    | [character]       | 2\. ATM2\                  | 365 (25.0%)\        |                      | (0.9%)  |
## |    |                   | 3\. ATM3\                  | 365 (25.0%)\        |                      |         |
## |    |                   | 4\. ATM4                   | 365 (25.0%)         |                      |         |
## +----+-------------------+----------------------------+---------------------+----------------------+---------+
## | 3  | Cash\             | Mean (sd) : 155.6 (376.5)\ | 509 distinct values | ![](/tmp/ds0033.png) | 19\     |
## |    | [numeric]         | min < med < max:\          |                     |                      | (1.3%)  |
## |    |                   | 0 < 73 < 10919.8\          |                     |                      |         |
## |    |                   | IQR (CV) : 113.5 (2.4)     |                     |                      |         |
## +----+-------------------+----------------------------+---------------------+----------------------+---------+

Exploratory Data Analysis

head(atm_raw)

## # A tibble: 6 × 3
##   DATE                ATM    Cash
##   <dttm>              <chr> <dbl>
## 1 2009-05-01 00:00:00 ATM1     96
## 2 2009-05-01 00:00:00 ATM2    107
## 3 2009-05-02 00:00:00 ATM1     82
## 4 2009-05-02 00:00:00 ATM2     89
## 5 2009-05-03 00:00:00 ATM1     85
## 6 2009-05-03 00:00:00 ATM2     90

#filter raw data by ATM
atm_raw %>% 
  filter(ATM == "ATM1")%>% 
  head(10)

## # A tibble: 10 × 3
##    DATE                ATM    Cash
##    <dttm>              <chr> <dbl>
##  1 2009-05-01 00:00:00 ATM1     96
##  2 2009-05-02 00:00:00 ATM1     82
##  3 2009-05-03 00:00:00 ATM1     85
##  4 2009-05-04 00:00:00 ATM1     90
##  5 2009-05-05 00:00:00 ATM1     99
##  6 2009-05-06 00:00:00 ATM1     88
##  7 2009-05-07 00:00:00 ATM1      8
##  8 2009-05-08 00:00:00 ATM1    104
##  9 2009-05-09 00:00:00 ATM1     87
## 10 2009-05-10 00:00:00 ATM1     93

Let’s look at the correlations

atm_raw$DATE = ymd(as.Date(atm_raw$DATE, origin = "1899-12-30"))

atm_ts = as_tsibble(tibble(atm_raw),index='DATE',key='ATM')
plot_correlation(atm_raw)

ATM2

#filter raw data by ATM
atm_raw %>% 
  filter(ATM == "ATM2")%>% 
  head(10)

## # A tibble: 10 × 3
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-05-01 ATM2    107
##  2 2009-05-02 ATM2     89
##  3 2009-05-03 ATM2     90
##  4 2009-05-04 ATM2     55
##  5 2009-05-05 ATM2     79
##  6 2009-05-06 ATM2     19
##  7 2009-05-07 ATM2      2
##  8 2009-05-08 ATM2    103
##  9 2009-05-09 ATM2    107
## 10 2009-05-10 ATM2    118

ATM3

#filter raw data by ATM
atm_raw %>% 
  filter(ATM == "ATM3")%>% 
  head(10)

## # A tibble: 10 × 3
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-05-01 ATM3      0
##  2 2009-05-02 ATM3      0
##  3 2009-05-03 ATM3      0
##  4 2009-05-04 ATM3      0
##  5 2009-05-05 ATM3      0
##  6 2009-05-06 ATM3      0
##  7 2009-05-07 ATM3      0
##  8 2009-05-08 ATM3      0
##  9 2009-05-09 ATM3      0
## 10 2009-05-10 ATM3      0

ATM4

#filter raw data by ATM
atm_raw %>% 
  filter(ATM == "ATM4")%>% 
  head(10)

## # A tibble: 10 × 3
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-05-01 ATM4  777. 
##  2 2009-05-02 ATM4  524. 
##  3 2009-05-03 ATM4  793. 
##  4 2009-05-04 ATM4  908. 
##  5 2009-05-05 ATM4   52.8
##  6 2009-05-06 ATM4   52.2
##  7 2009-05-07 ATM4   55.5
##  8 2009-05-08 ATM4  559. 
##  9 2009-05-09 ATM4  904. 
## 10 2009-05-10 ATM4  879.

ATM4 Contains values with decimals.

Let’s visualize the cash withdrawals

atm_raw %>% 
  filter(DATE < "2010-05-01", !is.na(ATM)) %>% 
  ggplot(aes(x = Cash)) +
    geom_histogram(bins = 30) +
    facet_wrap(~ ATM, ncol = 2, scales = "free") +
    labs(title = "ATM Withdrawals") +
    scale_y_continuous("Withdrawals (hundreds)")

atm_raw %>% 
  filter(DATE < "2010-05-01", !is.na(ATM)) %>% 
  ggplot(aes(x = DATE, y = Cash, col = ATM)) +
    geom_line(show.legend = FALSE) +
    facet_wrap(~ ATM, ncol = 1, scales = "free_y") +
    labs(title = "ATM Withdrawals", x = "Date") +
    scale_y_continuous("Withdrawals (hundreds)")

atm <- atm_raw %>% 
  #lowercase column names
  rename(date=DATE, atm=ATM, cash=Cash) %>%
  #POSIXct to date type , cash = cash*100
  mutate(date = as.Date(date)) %>% 
  #convert from long to wide by expanding the atm column to get individual column for each atm
  pivot_wider(names_from=atm, values_from = cash) %>% 
  #remove the NA column
  select(-"NA") %>% 
  #filter out what we will be forecasting
  filter(date < "2010-05-01") %>% 
  as_tsibble(index=date) 
head(atm)

## # A tsibble: 6 x 5 [1D]
##   date        ATM1  ATM2  ATM3  ATM4
##   <date>     <dbl> <dbl> <dbl> <dbl>
## 1 2009-05-01    96   107     0 777. 
## 2 2009-05-02    82    89     0 524. 
## 3 2009-05-03    85    90     0 793. 
## 4 2009-05-04    90    55     0 908. 
## 5 2009-05-05    99    79     0  52.8
## 6 2009-05-06    88    19     0  52.2

NA value

apply(is.na(atm),2,sum)

## date ATM1 ATM2 ATM3 ATM4 
##    0    3    2    0    0

atm[!complete.cases(atm), ]

## # A tsibble: 5 x 5 [1D]
##   date        ATM1  ATM2  ATM3  ATM4
##   <date>     <dbl> <dbl> <dbl> <dbl>
## 1 2009-06-13    NA    91     0 746. 
## 2 2009-06-16    NA    82     0 373. 
## 3 2009-06-18    21    NA     0  92.5
## 4 2009-06-22    NA    90     0  80.6
## 5 2009-06-24    66    NA     0  90.6

There are only 5 NA values. 3 from ATM1 and 2 from ATM2

plot_boxplot(atm_ts,by="ATM")

Summary statistics

summary(atm)

##       date                 ATM1             ATM2             ATM3        
##  Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000  
##  1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 25.50   1st Qu.: 0.0000  
##  Median :2009-10-30   Median : 91.00   Median : 67.00   Median : 0.0000  
##  Mean   :2009-10-30   Mean   : 83.89   Mean   : 62.58   Mean   : 0.7206  
##  3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000  
##  Max.   :2010-04-30   Max.   :180.00   Max.   :147.00   Max.   :96.0000  
##                       NA's   :3        NA's   :2                         
##       ATM4          
##  Min.   :    1.563  
##  1st Qu.:  124.334  
##  Median :  403.839  
##  Mean   :  474.043  
##  3rd Qu.:  704.507  
##  Max.   :10919.762  
##

ATM1: Contains 3 NA values. Seasonal pattern detected with no trend where the time series is affected by the day of the week. The distribution of the cash values are left skewed.
ATM2: Contains 2 NA values. Seasonal pattern detected with no trend where the time series is affected by the day of the week. The distribution of the cash values are left skewed.
ATM3: There are only three dates with amounts which are 04/28/2010, 04/29/2010, and 04/30/2010, all other dates have cash values of 0. Conducting a forecast for ATM3 will not produce much value due to lack of data
ATM4: ATM4 shows seasonality with no trend. The distribution of the cash values are right skewed. For the purpose of this project, the data will be rounded to get rid of the additional values suggesting odd dollar and coin amounts.

atm$ATM4 <- round(atm$ATM4, 0)
head(atm)

## # A tsibble: 6 x 5 [1D]
##   date        ATM1  ATM2  ATM3  ATM4
##   <date>     <dbl> <dbl> <dbl> <dbl>
## 1 2009-05-01    96   107     0   777
## 2 2009-05-02    82    89     0   524
## 3 2009-05-03    85    90     0   793
## 4 2009-05-04    90    55     0   908
## 5 2009-05-05    99    79     0    53
## 6 2009-05-06    88    19     0    52

Missing Values and Outliers

Using the na.interp() function from the forecast library, the missing values for ATM1 and ATM2 are generated through interpolation.

#generate estimates for NAs and replace with new values
atm$ATM1 <- na.interp(atm$ATM1)
atm$ATM2 <-  na.interp(atm$ATM2)

Next, using the tsoutliers() function from the forecast library, we identify and replace the outlier skewing the data for ATM4.

#ATM4
#identify the outlier index and generate a replacement 
tsoutliers(atm$ATM4)

## $index
## [1] 285
## 
## $replacements
## [1] 230

#do the replacement
atm$ATM4[285] <- tsoutliers(atm$ATM4)$replacements

As skews detected for ATMs 1 and 2, we will also use tsoutliers() to detect any existing outliers and replace them accordingly.

tsoutliers(atm$ATM1)

## $index
## integer(0)
## 
## $replacements
## numeric(0)

tsoutliers(atm$ATM2)

## $index
## integer(0)
## 
## $replacements
## numeric(0)

Let’s check if there is any NA values remaining in ATMs 1 and 2.

#verify how many NA values
apply(is.na(atm),2,sum)

## date ATM1 ATM2 ATM3 ATM4 
##    0    0    0    0    0

summary(atm)

##       date                 ATM1             ATM2             ATM3        
##  Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000  
##  1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 26.00   1st Qu.: 0.0000  
##  Median :2009-10-30   Median : 91.00   Median : 67.00   Median : 0.0000  
##  Mean   :2009-10-30   Mean   : 84.15   Mean   : 62.59   Mean   : 0.7206  
##  3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000  
##  Max.   :2010-04-30   Max.   :180.00   Max.   :147.00   Max.   :96.0000  
##       ATM4       
##  Min.   :   2.0  
##  1st Qu.: 124.0  
##  Median : 403.0  
##  Mean   : 444.7  
##  3rd Qu.: 704.0  
##  Max.   :1712.0

ATM4 has a high right skew on the data which is an indicator that a transformation may be required.

Forecast

ATM_1 <- atm %>% 
  select(date, ATM1)
ATM_1 %>%
  model(
    STL(ATM1 ~ trend(window = 7) +
                   season(window = "periodic"),
    robust = TRUE)) %>%
  components() %>%
  autoplot()

ACF

ATM_1 %>% 
  ACF(ATM1, lag_max = 28) %>% 
  autoplot()

  #gg_tsdisplay(plot_type='partial', lag_max = 28)

The ACF plot suggest lags at 2, 5, and 7 with 7 being the most significant lag. The ACF seems to be decreasing relatively slowly which could be an indicator that the data is non-stationary and could potentially benefit from differencing but additional checks are required.

Following from above and in preparation for an ARIMA model, ndiffs() is used to determine if any differencing is required.

ndiffs(ATM_1$ATM1)

## [1] 0

Based on the output, there is no differencing required.

Model

Alternative models for this series are ETS and ARIMA models. An additional model that has been included is the Auto ARIMA, which is an automatically selected model by R with optimal values.

atm4_out = atm_ts %>% filter(ATM=='ATM4') %>% select(Cash)
atm3_out = atm_ts %>% filter(ATM=='ATM3') %>% select(Cash)
atm2_out = atm_ts %>% filter(ATM=='ATM2') %>% select(Cash)
atm1_out = atm_ts %>% filter(ATM=='ATM1') %>% select(Cash)

atm4_out$Cash[tsoutliers(atm4_out$Cash)$index] = tsoutliers(atm4_out$Cash)$replacements

clean_atm_ts = atm_ts %>% drop_na()
plot_missing(clean_atm_ts)

clean_atm_ts %>% filter(ATM=='ATM1') %>% autoplot(.vars=Cash)

clean_atm_ts %>% filter(ATM=='ATM2') %>% autoplot(.vars=Cash)

clean_atm_ts %>% filter(ATM=='ATM3') %>% autoplot(.vars=Cash)

clean_atm_ts %>% filter(ATM=='ATM4') %>% autoplot(.vars=Cash)

atm_ts %>% filter(ATM=='ATM1') %>% ACF() %>% autoplot(.vars=Cash)

atm_ts %>% filter(ATM=='ATM2') %>% ACF() %>% autoplot(.vars=Cash)

atm_ts %>% filter(ATM=='ATM3') %>% ACF() %>% autoplot(.vars=Cash)

atm_ts %>% filter(ATM=='ATM4') %>% ACF() %>% autoplot(.vars=Cash)

atm4_lambda = atm4_out %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm4_trans = atm4_out %>% ACF(difference(box_cox(Cash,atm4_lambda),31)) %>% autoplot()

atm4_trans_data = atm4_out %>% mutate(Cash = difference(box_cox(Cash,atm4_lambda),31))

atm4_trans

atm3_lambda = atm3_out %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm3_trans = atm3_out %>% ACF(difference(box_cox(Cash,atm3_lambda),31)) %>% autoplot()

atm3_trans_data = atm3_out %>% mutate(Cash = difference(box_cox(Cash,atm3_lambda),31))

atm3_trans

atm2_lambda = atm2_out %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm2_trans = atm2_out %>% ACF(difference(box_cox(Cash,atm2_lambda),31)) %>% autoplot()

atm2_trans_data = atm2_out %>% mutate(Cash = difference(box_cox(Cash,atm2_lambda),31))

atm2_trans

atm1_lambda = atm1_out %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm1_trans = atm1_out %>% ACF(difference(box_cox(Cash,atm1_lambda),31)) %>% autoplot()

atm1_trans_data = atm1_out %>% mutate(Cash = difference(box_cox(Cash,atm1_lambda),31))

atm1_trans

atm1_trans = atm1_trans_data %>% drop_na() 
atm2_trans = atm2_trans_data %>% drop_na()
atm3_trans = atm3_trans_data %>% drop_na()
atm4_trans = atm4_trans_data %>% drop_na()


atmfit_arima_seach1 = atm1_trans_data %>% model(search = ARIMA(Cash,stepwise=FALSE))

atmfit_arima_seach2 = atm2_trans_data %>% model(search = ARIMA(Cash,stepwise=FALSE))

atmfit_arima_seach3 = atm3_trans_data %>% model(search = ARIMA(Cash,stepwise=FALSE))

atmfit_arima_seach4 = atm4_trans_data %>% model(search = ARIMA(Cash,stepwise=FALSE))

View reports of the models:

report(atmfit_arima_seach1)

## Series: Cash 
## Model: ARIMA(0,0,0)(2,1,0)[7] 
## 
## Coefficients:
##          sar1     sar2
##       -0.5846  -0.2379
## s.e.   0.0557   0.0558
## 
## sigma^2 estimated as 2.743:  log likelihood=-636.12
## AIC=1278.24   AICc=1278.31   BIC=1289.88

report(atmfit_arima_seach2)

## Series: Cash 
## Model: ARIMA(5,0,0)(1,1,0)[7] 
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ar5     sar1
##       0.0307  -0.1299  -0.0581  0.0208  -0.1488  -0.4586
## s.e.  0.0552   0.0556   0.0556  0.0553   0.0555   0.0511
## 
## sigma^2 estimated as 128.1:  log likelihood=-1257.68
## AIC=2529.36   AICc=2529.68   BIC=2556.52

report(atmfit_arima_seach3)

## Series: Cash 
## Model: ARIMA(2,0,0) w/ mean 
## 
## Coefficients:
##          ar1     ar2  constant
##       0.7955  0.1624    0.9352
## s.e.  0.0541  0.0615    1.2004
## 
## sigma^2 estimated as 383.4:  log likelihood=-1482.05
## AIC=2972.1   AICc=2972.21   BIC=2987.7

report(atmfit_arima_seach4)

## Series: Cash 
## Model: ARIMA(0,0,0)(2,0,0)[7] 
## 
## Coefficients:
##         sar1    sar2
##       0.2068  0.2231
## s.e.  0.0538  0.0546
## 
## sigma^2 estimated as 346.6:  log likelihood=-1465.07
## AIC=2936.14   AICc=2936.21   BIC=2947.84

Visualize the residuals:

atmfit_arima_seach1 %>% gg_tsresiduals()

atmfit_arima_seach2 %>% gg_tsresiduals()

atmfit_arima_seach3 %>% gg_tsresiduals()

atmfit_arima_seach4 %>% gg_tsresiduals()

glance(atmfit_arima_seach1)

## # A tibble: 1 × 8
##   .model sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots 
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>   
## 1 search   2.74   -636. 1278. 1278. 1290. <cpl [14]> <cpl [0]>

glance(atmfit_arima_seach2)

## # A tibble: 1 × 8
##   .model sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots 
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>   
## 1 search   128.  -1258. 2529. 2530. 2557. <cpl [12]> <cpl [0]>

glance(atmfit_arima_seach3)

## # A tibble: 1 × 8
##   .model sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
## 1 search   383.  -1482. 2972. 2972. 2988. <cpl [2]> <cpl [0]>

glance(atmfit_arima_seach4)

## # A tibble: 1 × 8
##   .model sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots 
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>   
## 1 search   347.  -1465. 2936. 2936. 2948. <cpl [14]> <cpl [0]>

Forcasts

atm_arima_model1 = auto.arima(atm1_trans, lambda = "auto", seasonal = TRUE)
forcast1 = forecast::forecast(atm_arima_model1, h=31)
autoplot(forcast1)

atm_arima_model2 = auto.arima(atm2_trans, lambda = "auto", seasonal = TRUE)
forcast2 = forecast::forecast(atm_arima_model2, h=30)
autoplot(forcast2)

atm_arima_model3 = auto.arima(atm3_trans, lambda = "auto", seasonal = TRUE)
forcast3 = forecast::forecast(atm_arima_model3, h=30)
autoplot(forcast3)

atm_arima_model4 = auto.arima(atm4_trans, lambda = "auto", seasonal = TRUE)
forcast4 = forecast::forecast(atm_arima_model4, h=30) 
autoplot(forcast4)

Part B – Forecasting Power

Load Data

power_raw <- read_xlsx("ResidentialCustomerForecastLoad-624.xlsx")
str(power_raw)

## tibble [192 × 3] (S3: tbl_df/tbl/data.frame)
##  $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
##  $ YYYY-MMM    : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
##  $ KWH         : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...

The YYYY-MM column will need to be converted to a date type in order to be used as the index for the required tsibble object.

#NA
power_raw[!complete.cases(power_raw), ]

## # A tibble: 1 × 3
##   CaseSequence `YYYY-MMM`   KWH
##          <dbl> <chr>      <dbl>
## 1          861 2008-Sep      NA

Exploratory Data Analysis

colnames(power_raw) = c('CS','DATE','KWH')
power_raw = power_raw %>% select(DATE,KWH)
power_raw$DATE = yearmonth(as.Date(as.yearmon(power_raw$DATE, "%Y-%b")))
power_raw$KWH = as.numeric(power_raw$KWH)

power_ts = as_tsibble(tibble(power_raw),index='DATE')

Plots:

plot_str(power_ts)

plot_density(power_ts)

plot_missing(power_ts)

plot_histogram(power_ts)

power_ts %>% ACF() %>% autoplot()

power_ts %>% autoplot(.vars=KWH)

power_ts$KWH[tsoutliers(ts(power_ts$KWH, frequency=12))$index] = tsoutliers(ts(power_ts$KWH, frequency=12))$replacements

Model

power_lambda = power_ts %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

power_trans = power_ts %>% ACF(difference(box_cox(KWH,power_lambda),12)) %>% autoplot()

power_trans_data = power_ts %>% mutate(KWH = difference(box_cox(KWH,power_lambda),12))

power_trans

power_trans_data %>% autoplot()

plot_density(power_trans_data)

t1_p = power_trans_data %>% drop_na()

adf.test(t1_p$KWH)

## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag   ADF p.value
## [1,]   0 -9.82    0.01
## [2,]   1 -7.86    0.01
## [3,]   2 -5.80    0.01
## [4,]   3 -6.19    0.01
## [5,]   4 -5.13    0.01
## Type 2: with drift no trend 
##      lag   ADF p.value
## [1,]   0 -9.88    0.01
## [2,]   1 -7.93    0.01
## [3,]   2 -5.88    0.01
## [4,]   3 -6.29    0.01
## [5,]   4 -5.24    0.01
## Type 3: with drift and trend 
##      lag   ADF p.value
## [1,]   0 -9.90    0.01
## [2,]   1 -7.96    0.01
## [3,]   2 -5.89    0.01
## [4,]   3 -6.33    0.01
## [5,]   4 -5.26    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

Model Creation

Report

power_arima_search = power_trans_data %>% model(search = ARIMA(KWH,stepwise=FALSE))
report(power_arima_search)

## Series: KWH 
## Model: ARIMA(4,0,0)(2,0,0)[12] w/ mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4     sar1     sar2  constant
##       0.2556  -0.0062  0.2442  -0.1481  -0.7166  -0.4151    0.0023
## s.e.  0.0750   0.0750  0.0742   0.0742   0.0739   0.0747    0.0008
## 
## sigma^2 estimated as 9.298e-05:  log likelihood=565.58
## AIC=-1115.16   AICc=-1114.38   BIC=-1089.1

glance(power_arima_search)

## # A tibble: 1 × 8
##   .model    sigma2 log_lik    AIC   AICc    BIC ar_roots   ma_roots 
##   <chr>      <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <list>     <list>   
## 1 search 0.0000930    566. -1115. -1114. -1089. <cpl [28]> <cpl [0]>

Residuals

power_arima_search %>% gg_tsresiduals()

Forcast

power_arima_search = auto.arima(power_trans_data, lambda = "auto", stepwise=FALSE, seasonal = TRUE) 

power_forcast = forecast::forecast(power_arima_search, h=12)
autoplot(power_forcast)

Data 624 | Project1

Anjal Hussan

2023-03-24

Project 1

Part A – ATM Forecast

Load Data

Exploratory Data Analysis

Missing Values and Outliers

Forecast

ACF

Model

View reports of the models:

Visualize the residuals:

Forcasts

Part B – Forecasting Power

Load Data

Exploratory Data Analysis

Plots:

Model

Model Creation

Report

Residuals

Forcast