Project 1

Project 1 Description

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Apr 11. I will accept late submissions with a penalty until the meetup after that when we review some projects. Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part A

Load Data

df_atm <- read.csv("C:/Users/SeungminSong/Downloads/624R/ATM624Data.csv")

head(df_atm)
##                   DATE  ATM Cash
## 1 5/1/2009 12:00:00 AM ATM1   96
## 2 5/1/2009 12:00:00 AM ATM2  107
## 3 5/2/2009 12:00:00 AM ATM1   82
## 4 5/2/2009 12:00:00 AM ATM2   89
## 5 5/3/2009 12:00:00 AM ATM1   85
## 6 5/3/2009 12:00:00 AM ATM2   90

Missing Value

If there are missing values, it is difficult to convert them to a time series, so the missing values are deleted. Moreover, given the high liquidity of ATM cash withdrawals, removing missing values is deemed more practical for accurate analysis.

sum(is.na(df_atm))
## [1] 19
df_atm <- na.omit(df_atm)

Time Series Transformation

df_atm$DATE <- as.POSIXct(df_atm$DATE, format="%m/%d/%Y %I:%M:%S %p")
# Define as tsibble with DATE as index
#atm_data_ts <-df_atm %>%
#  as_tsibble(index = DATE, key = ATM)
#atm_data_ts 

ATM 1

ATM1 represents stationary time series data. The Augmented Dickey-Fuller (ADF) test can determine if a time series is stationary by checking for a unit root in the data. In a stationary time series, the mean, variance, and covariance remain constant over time.

# Filter to ATM1 data
atm1_data <- df_atm %>%
  filter(ATM == 'ATM1')

nrow(atm1_data)
## [1] 362
head(atm1_data)
##         DATE  ATM Cash
## 1 2009-05-01 ATM1   96
## 2 2009-05-02 ATM1   82
## 3 2009-05-03 ATM1   85
## 4 2009-05-04 ATM1   90
## 5 2009-05-05 ATM1   99
## 6 2009-05-06 ATM1   88
atm1_data_ts <- atm1_data %>%
  mutate(DATE = as.Date(DATE, format = "%m/%d/%Y %I:%M:%S %p")) %>%
  as_tsibble(index = DATE)

Plot

plot(atm1_data_ts$DATE, atm1_data_ts$Cash, type="l",
     main="Cash (in hundreds $USD)")

Augmented Dickey-Fuller (ADF)

Dickey-Fuller value:

  • The test statistic is -4.6043. This value is negative, providing strong evidence that there is no unit root (implying non-stationarity). The more negative the statistic, the stronger the evidence for the absence of a unit root, which means the time series is more likely to be stationary.

Lag order:

  • The number of lags is set to 7. This represents the optimal number of lags taken into account when the ADF test is calculated. The number of delays is adjusted according to the data to increase the accuracy of the test.

p-value:

  • 0.01, which is very low. This value means that there is sufficient evidence to reject the null hypothesis (that the time series data are non-stationary with a unit root). Generally, if the p-value is less than 0.05 (5%), the null hypothesis is rejected and the data are considered normal. Here we reject the null hypothesis even at the 1% level, so there is very strong evidence that the data are normal.

This ATM data means that the time series maintains a constant mean and variance over time, and the data can be interpreted as having time series stationarity. Because it is a stationary data, decided that transformation is not necessary.

atm1_ts <- ts(atm1_data_ts$Cash, start=c(2009,5), frequency=7)

adf_test_result <- adf.test(atm1_ts, alternative = "stationary")

print(adf_test_result)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm1_ts
## Dickey-Fuller = -4.6043, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Moving Average

The moving average stays within a stable range, suggesting a consistent trend in the time series data.

atm1_data_ts <- atm1_data_ts %>%
  mutate(Moving_Average = rollapply(Cash, 7, mean, partial = TRUE, align = 'right'))

#print(atm1_data_ts, n = Inf)

STL

Because the data is for one year, frequency 365 is not possible. Therefore, check weekly data with frequency 7.

Overall, this graph can be interpreted as showing that the time series data is seasonal, with trends that are relatively stable but have some increases or decreases. Although there are no major outliers in the residual panel, it is recommended to apply a statistical outlier detection technique to the residual data to more accurately determine whether there are outliers.

atm1_ts <- ts(atm1_data_ts$Cash, start = c(2009, 5), frequency=7)

atm1_decomposed <- stl(atm1_ts, s.window="periodic")

plot(atm1_decomposed)

Outlier

STL detected there are no major outliers in the residual panel. However, when looking at the residual data separately, outliers are identified. However, due to the nature of the cash dispenser, there may be sufficient up and down volatility, so it was decided to maintain it without deleting or replacing it.

remainder_data <- atm1_decomposed$time.series[, "remainder"]

boxplot(remainder_data, main="Boxplot for Remainder Data")

outliers <- tsoutliers(remainder_data)

print(outliers)
## $index
## [1] 47
## 
## $replacements
## [1] 45.05059
outlier_values_atm1_data_ts <- boxplot.stats(atm1_data_ts$Cash)$out

#par(mfrow=c(2,2)) 

boxplot(atm1_data_ts$Cash, main = "ATM1", boxwex = 0.5)
mtext(paste("Outliers: ", paste(outlier_values_atm1_data_ts, collapse = ", ")), cex = 0.6)

ACF & PACF

Given that the ACF gradually decreases as the lag increases and the PACF shows no significant spikes, this might indicate that the time series data is more suited for an ARIMA model rather than a pure AR or MA model.

par(mfrow = c(2, 1))

acf(atm1_ts, main="ACF for ATM1")

pacf(atm1_ts, main="PACF for ATM1")

ATM1 Model

Arima(0,1,1) AIC and BIC is lowever than Arima(0,0,1). Used (0,1,1) for the forecast

Set seasonal = TRUE to indicate seasonality.

arima <- auto.arima(atm1_ts, seasonal = TRUE) 
summary(arima)
## Series: atm1_ts 
## ARIMA(0,0,1)(0,1,1)[7] 
## 
## Coefficients:
##          ma1     sma1
##       0.1674  -0.5813
## s.e.  0.0565   0.0472
## 
## sigma^2 = 666.6:  log likelihood = -1658.32
## AIC=3322.63   AICc=3322.7   BIC=3334.25
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.09700089 25.49555 16.17552 -106.2792 122.4165 0.8594986
##                     ACF1
## Training set -0.01116544
checkresiduals(arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 11.267, df = 12, p-value = 0.5062
## 
## Model df: 2.   Total lags used: 14

Some trends are detected, so include.drift = TRUE

model <- Arima(atm1_ts, order = c(0,1,1), include.drift = TRUE)
model
## Series: atm1_ts 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##           ma1    drift
##       -1.0000  -0.0251
## s.e.   0.0101   0.0184
## 
## sigma^2 = 1344:  log likelihood = -1814.42
## AIC=3634.84   AICc=3634.91   BIC=3646.51
checkresiduals(model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 353.17, df = 13, p-value < 2.2e-16
## 
## Model df: 1.   Total lags used: 14
model7 <- Arima(atm1_ts, order = c(0,0,1), include.drift = TRUE)
model7
## Series: atm1_ts 
## ARIMA(0,0,1) with drift 
## 
## Coefficients:
##          ma1  intercept    drift
##       0.0993    88.4623  -0.0252
## s.e.  0.0668     4.2126   0.0201
## 
## sigma^2 = 1336:  log likelihood = -1814.92
## AIC=3637.85   AICc=3637.96   BIC=3653.42
checkresiduals(model7)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1) with drift
## Q* = 349.77, df = 13, p-value < 2.2e-16
## 
## Model df: 1.   Total lags used: 14
fore<- forecast(model, h = 30)
plot(fore)

Forecast

The median lower bound of the 95% prediction interval is approximately 7.00, and the median upper bound is approximately 150.92.

fore
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2061.286       79.32234 32.27145 126.3732 7.364218 151.2805
## 2061.429       79.29719 32.24630 126.3481 7.339069 151.2553
## 2061.571       79.27204 32.22116 126.3229 7.313921 151.2302
## 2061.714       79.24689 32.19601 126.2978 7.288773 151.2050
## 2061.857       79.22175 32.17086 126.2726 7.263625 151.1799
## 2062.000       79.19660 32.14571 126.2475 7.238477 151.1547
## 2062.143       79.17145 32.12056 126.2223 7.213328 151.1296
## 2062.286       79.14630 32.09541 126.1972 7.188180 151.1044
## 2062.429       79.12115 32.07027 126.1720 7.163032 151.0793
## 2062.571       79.09600 32.04512 126.1469 7.137884 151.0541
## 2062.714       79.07086 32.01997 126.1217 7.112735 151.0290
## 2062.857       79.04571 31.99482 126.0966 7.087587 151.0038
## 2063.000       79.02056 31.96967 126.0714 7.062439 150.9787
## 2063.143       78.99541 31.94453 126.0463 7.037291 150.9535
## 2063.286       78.97026 31.91938 126.0212 7.012143 150.9284
## 2063.429       78.94512 31.89423 125.9960 6.986994 150.9032
## 2063.571       78.91997 31.86908 125.9709 6.961846 150.8781
## 2063.714       78.89482 31.84393 125.9457 6.936698 150.8529
## 2063.857       78.86967 31.81878 125.9206 6.911550 150.8278
## 2064.000       78.84452 31.79364 125.8954 6.886401 150.8026
## 2064.143       78.81937 31.76849 125.8703 6.861253 150.7775
## 2064.286       78.79423 31.74334 125.8451 6.836105 150.7523
## 2064.429       78.76908 31.71819 125.8200 6.810957 150.7272
## 2064.571       78.74393 31.69304 125.7948 6.785809 150.7021
## 2064.714       78.71878 31.66790 125.7697 6.760660 150.6769
## 2064.857       78.69363 31.64275 125.7445 6.735512 150.6518
## 2065.000       78.66849 31.61760 125.7194 6.710364 150.6266
## 2065.143       78.64334 31.59245 125.6942 6.685216 150.6015
## 2065.286       78.61819 31.56730 125.6691 6.660068 150.5763
## 2065.429       78.59304 31.54215 125.6439 6.634919 150.5512

ATM 2

# Filter to ATM2 data
atm2_data <- df_atm %>%
  filter(ATM == 'ATM2')

nrow(atm2_data)
## [1] 363
head(atm2_data)
##         DATE  ATM Cash
## 1 2009-05-01 ATM2  107
## 2 2009-05-02 ATM2   89
## 3 2009-05-03 ATM2   90
## 4 2009-05-04 ATM2   55
## 5 2009-05-05 ATM2   79
## 6 2009-05-06 ATM2   19
atm2_data_ts <- atm2_data %>%
  mutate(DATE = as.Date(DATE, format = "%m/%d/%Y %I:%M:%S %p")) %>%
  as_tsibble(index = DATE)

Plot

plot(atm2_data_ts$DATE, atm2_data_ts$Cash, type="l",
     main="Cash (in hundreds $USD)")

Augmented Dickey-Fuller (ADF)

Dickey-Fuller value:

  • The test statistic is -6.009. This value is negative, providing strong evidence that there is no unit root (implying non-stationarity). The more negative the statistic, the stronger the evidence for the absence of a unit root, which means the time series is more likely to be stationary.

Lag order:

  • The number of lags is set to 7. This represents the optimal number of lags taken into account when the ADF test is calculated. The number of delays is adjusted according to the data to increase the accuracy of the test.

p-value:

  • 0.01, which is very low. This value means that there is sufficient evidence to reject the null hypothesis (that the time series data are non-stationary with a unit root). Generally, if the p-value is less than 0.05 (5%), the null hypothesis is rejected and the data are considered normal. Here we reject the null hypothesis even at the 1% level, so there is very strong evidence that the data are normal.

This ATM data means that the time series maintains a constant mean and variance over time, and the data can be interpreted as having time series stationarity. Because it is a stationary data, decided that transformation is not necessary.

atm2_ts <- ts(atm2_data_ts$Cash, start=c(2009,5), frequency=7)

adf_test_result2 <- adf.test(atm2_ts, alternative = "stationary")

print(adf_test_result2)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm2_ts
## Dickey-Fuller = -6.009, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Moving Average

The moving average stays within a stable range, suggesting a consistent trend in the time series data.

atm2_data_ts <- atm2_data_ts %>%
  mutate(Moving_Average = rollapply(Cash, 7, mean, partial = TRUE, align = 'right'))

#print(atm2_data_ts, n = Inf)

STL

Because the data is for one year, frequency 365 is not possible. Therefore, check weekly data with frequency 7.

Overall, this graph can be interpreted as showing that the time series data is seasonal, with trends that are relatively stable but have some increases or decreases. Although there are no major outliers in the residual panel, it is recommended to apply a statistical outlier detection technique to the residual data to more accurately determine whether there are outliers.

atm2_ts <- ts(atm2_data_ts$Cash, start = c(2009, 5), frequency=7)

atm2_decomposed <- stl(atm2_ts, s.window="periodic")

plot(atm2_decomposed)

Outlier

STL detected there are no major outliers in the residual panel. However, when looking at the residual data separately, outliers are identified. However, due to the nature of the cash dispenser, there may be sufficient up and down volatility, so it was decided to maintain it without deleting or replacing it.

remainder_data <- atm2_decomposed$time.series[, "remainder"]

boxplot(remainder_data, main="Boxplot for Remainder Data")

outliers <- tsoutliers(remainder_data)

print(outliers)
## $index
##  [1]   5  12  19  26  33  36  39  40  47  49  55  57  62  69  70 132 176 212 241
## [20] 251 293
## 
## $replacements
##  [1] -28.4396017 -28.8897437 -22.6264346 -38.1987865 -18.3224833  30.4286118
##  [7] -17.6371514 -49.8802125 -49.6961975 -52.5743098 -14.8278293   4.8023732
## [13]  -4.0286966  10.3716430   0.7832172  44.5067303   7.9306150  33.2011885
## [19]  13.9146149  39.4783457  34.8244937
outlier_values_atm2_data_ts <- boxplot.stats(atm2_data_ts$Cash)$out

#par(mfrow=c(2,2)) 

boxplot(atm2_data_ts$Cash, main = "ATM2", boxwex = 0.5)
mtext(paste("Outliers: ", paste(outlier_values_atm2_data_ts, collapse = ", ")), cex = 0.6)

ACF & PACF

Given that the ACF gradually decreases as the lag increases and the PACF shows no significant spikes, this might indicate that the time series data is more suited for an ARIMA model rather than a pure AR or MA model.

par(mfrow = c(2, 1))

# Plotting the ACF for Amazon's closing stock prices
acf(atm2_ts, main="ACF for ATM2")

# Plotting the PACF for Amazon's closing stock prices
pacf(atm2_ts, main="PACF for ATM2")

ATM2 Model

The median lower bound of the 95% prediction interval is approximately 7.00, and the median upper bound is approximately 150.92.

# Since seasonality appears, seasonal = TRUE
arima2 <- auto.arima(atm2_ts, seasonal = TRUE) 
summary(arima2)
## Series: atm2_ts 
## ARIMA(2,0,2)(0,1,2)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1    sma2
##       -0.4148  -0.8922  0.4008  0.7534  -0.7220  0.0801
## s.e.   0.0674   0.0510  0.1027  0.0688   0.0567  0.0550
## 
## sigma^2 = 708.1:  log likelihood = -1671.98
## AIC=3357.96   AICc=3358.28   BIC=3385.08
## 
## Training set error measures:
##                      ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set -0.4583759 26.12942 18.09675 -Inf  Inf 0.8421495 0.009277035
checkresiduals(arima2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,2)[7]
## Q* = 12.12, df = 8, p-value = 0.1459
## 
## Model df: 6.   Total lags used: 14
  • Generally, the lower the AIC and BIC and the higher the log likelihood, the better the model fit. When comparing the two models, the AIC and BIC of the ARIMA(0,1,2) model are higher and the log likelihood is lower, so the ARIMA(2,0,2) model is judged to have a better fit.

  • The residuals of the ARIMA(2,0,2) model are slightly more centered and show lower ACF values.

  • The ARIMA(2,0,2) model shows slightly lower autocorrelation.

#Some trends are detected, so include.drift = TRUE
model2 <- Arima(atm2_ts, order = c(2,0,2), include.drift = TRUE)
model2
## Series: atm2_ts 
## ARIMA(2,0,2) with drift 
## 
## Coefficients:
##          ar1      ar2      ma1      ma2  intercept    drift
##       0.1435  -0.1639  -0.2071  -0.2652    74.8231  -0.0682
## s.e.  0.1134   0.0957   0.1083   0.0889     1.9165   0.0091
## 
## sigma^2 = 1234:  log likelihood = -1804.17
## AIC=3622.34   AICc=3622.65   BIC=3649.6
checkresiduals(model2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2) with drift
## Q* = 226.42, df = 10, p-value < 2.2e-16
## 
## Model df: 4.   Total lags used: 14
fore2<- forecast(model2, h = 30)
plot(fore2)

model3 <- Arima(atm2_ts, order = c(0,1,2), include.drift = TRUE)
model3
## Series: atm2_ts 
## ARIMA(0,1,2) with drift 
## 
## Coefficients:
##           ma1     ma2    drift
##       -1.0782  0.0782  -0.0671
## s.e.   0.1059  0.1055   0.0177
## 
## sigma^2 = 1473:  log likelihood = -1835.57
## AIC=3679.15   AICc=3679.26   BIC=3694.71
checkresiduals(model3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 493.19, df = 12, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 14

Forecast

Arima (2,0,2) model Forecast

The median lower bound of the 95% prediction interval is approximately -27.35777, and the median upper bound is approximately 123.75854. The negative values in this prediction result may be due to the nature of the statistical prediction model. Therefore, the range can be predicted to be around $0 - 124.

fore2
##          Point Forecast         Lo 80     Hi 80     Lo 95     Hi 95
## 2061.429       30.43631 -14.581220835  75.45384 -38.41206  99.28468
## 2061.571       24.55643 -20.551923674  69.66479 -44.43085  93.54372
## 2061.714       49.43449   0.201671011  98.66731 -25.86061 124.72959
## 2061.857       53.89972   4.610242351 103.18920 -21.48203 129.28147
## 2062.000       50.39397   1.019625798  99.76831 -25.11757 125.90551
## 2062.143       49.08938  -0.291486690  98.47024 -26.43214 124.61089
## 2062.286       49.40709   0.024918922  98.78926 -26.11642 124.93060
## 2062.429       49.59694   0.214427185  98.97945 -25.92709 125.12097
## 2062.571       49.50256   0.120037693  98.88508 -26.02149 125.02661
## 2062.714       49.38833   0.005801912  98.77087 -26.13573 124.91240
## 2062.857       49.31784  -0.064689703  98.70038 -26.20622 124.84191
## 2063.000       49.25688  -0.125651773  98.63941 -26.26719 124.78095
## 2063.143       49.19012  -0.192412584  98.57265 -26.33395 124.71419
## 2063.286       49.12097  -0.261567645  98.50350 -26.40310 124.64503
## 2063.429       49.05242  -0.330116002  98.43495 -26.47165 124.57648
## 2063.571       48.98435  -0.398184886  98.36688 -26.53972 124.50841
## 2063.714       48.91625  -0.466284372  98.29878 -26.60782 124.44031
## 2063.857       48.84807  -0.534466829  98.23060 -26.67600 124.37213
## 2064.000       48.77988  -0.602656182  98.16241 -26.74419 124.30394
## 2064.143       48.71170  -0.670832926  98.09423 -26.81237 124.23577
## 2064.286       48.64353  -0.739006731  98.02606 -26.88054 124.16759
## 2064.429       48.57535  -0.807182179  97.95788 -26.94872 124.09942
## 2064.571       48.50717  -0.875358346  97.88971 -27.01689 124.03124
## 2064.714       48.43900  -0.943534346  97.82153 -27.08507 123.96306
## 2064.857       48.37082  -1.011710205  97.75336 -27.15324 123.89489
## 2065.000       48.30265  -1.079886071  97.68518 -27.22142 123.82671
## 2065.143       48.23447  -1.148061960  97.61700 -27.28960 123.75854
## 2065.286       48.16629  -1.216237853  97.54883 -27.35777 123.69036
## 2065.429       48.09812  -1.284413741  97.48065 -27.42595 123.62219
## 2065.571       48.02994  -1.352589629  97.41248 -27.49412 123.55401

ATM 3

# Filter to ATM3 data
atm3_data <- df_atm %>%
  filter(ATM == 'ATM3')

nrow(atm3_data)
## [1] 365
head(atm3_data)
##         DATE  ATM Cash
## 1 2009-05-01 ATM3    0
## 2 2009-05-02 ATM3    0
## 3 2009-05-03 ATM3    0
## 4 2009-05-04 ATM3    0
## 5 2009-05-05 ATM3    0
## 6 2009-05-06 ATM3    0
atm3_data_ts <- atm3_data %>%
  mutate(DATE = as.Date(DATE, format = "%m/%d/%Y %I:%M:%S %p")) %>%
  as_tsibble(index = DATE)
sum(is.na(atm3_data_ts))
## [1] 0

Plot

plot(atm3_data_ts$DATE, atm3_data_ts$Cash, type="l",
     main="Cash (in hundreds $USD)")

Outlier

In fact, there are only three pieces of useful data, so the analysis of STL, ADF, ACF, and PACF previously performed is meaningless.

outlier_values_atm3_data_ts <- boxplot.stats(atm3_data_ts$Cash)$out

boxplot(atm3_data_ts$Cash, main = "ATM3", boxwex = 0.5)
mtext(paste("Outliers (ATM3): ", paste(outlier_values_atm1_data_ts, collapse = ", ")), cex = 0.6)

atm3_data_ts$Cash  <- ifelse(atm3_data_ts$Cash < 1, 
                         NA, atm3_data_ts$Cash )
atm3_data_ts$Cash
##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [151] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [176] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [201] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [226] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [251] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [276] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [301] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [326] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [351] NA NA NA NA NA NA NA NA NA NA NA NA 96 82 85
sum(is.na(atm3_data_ts$Cash))
## [1] 362
new_atm3_data_ts <- atm3_data_ts %>% filter(!is.na(Cash))
new_atm3_data_ts
## # A tsibble: 3 x 3 [1D]
##   DATE       ATM    Cash
##   <date>     <chr> <int>
## 1 2010-04-28 ATM3     96
## 2 2010-04-29 ATM3     82
## 3 2010-04-30 ATM3     85

ATM3 Model

arima <- auto.arima(atm3_data_ts$Cash) 
arima
## Series: atm3_data_ts$Cash 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##          mean
##       87.6667
## s.e.   3.4749
## 
## sigma^2 = 54.33:  log likelihood = -9.64
## AIC=23.28   AICc=Inf   BIC=21.48

The p-value here is very high at 0.9941. This means that there is not enough evidence to reject the null hypothesis, so it is very likely that the residuals are white noise. Put another way, it means that the model has captured the time series structure of the data well, with little pattern or structural information remaining in the residuals.

model4 <- arima(atm3_data_ts$Cash, order = c(0,0,2))
model4
## 
## Call:
## arima(x = atm3_data_ts$Cash, order = c(0, 0, 2))
## 
## Coefficients:
##           ma1     ma2  intercept
##       -1.6664  1.0000     86.810
## s.e.   1.4564  1.4541      1.154
## 
## sigma^2 estimated as 7.02:  log likelihood = -8.69,  aic = 25.37
tsdiag(model4)

Box.test(model4$residuals, lag=1, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  model4$residuals
## X-squared = 5.4051e-05, df = 1, p-value = 0.9941
fore <- forecast(model4, h=30)

plot(fore, col = "black", lwd=2, flty=3, shadecols = c("gray", "mistyrose"))

Forecast

The median lower bound of the 95% prediction is 75.46026, and the median upper bound of the 95% prediction is 98.15983.

fore
##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 366       90.08179 86.02022 94.14337 83.87015 96.29344
## 367       85.65046 78.69122 92.60969 75.00723 96.29369
## 368       86.81005 79.38882 94.23128 75.46026 98.15983
## 369       86.81005 79.38882 94.23128 75.46026 98.15983
## 370       86.81005 79.38882 94.23128 75.46026 98.15983
## 371       86.81005 79.38882 94.23128 75.46026 98.15983
## 372       86.81005 79.38882 94.23128 75.46026 98.15983
## 373       86.81005 79.38882 94.23128 75.46026 98.15983
## 374       86.81005 79.38882 94.23128 75.46026 98.15983
## 375       86.81005 79.38882 94.23128 75.46026 98.15983
## 376       86.81005 79.38882 94.23128 75.46026 98.15983
## 377       86.81005 79.38882 94.23128 75.46026 98.15983
## 378       86.81005 79.38882 94.23128 75.46026 98.15983
## 379       86.81005 79.38882 94.23128 75.46026 98.15983
## 380       86.81005 79.38882 94.23128 75.46026 98.15983
## 381       86.81005 79.38882 94.23128 75.46026 98.15983
## 382       86.81005 79.38882 94.23128 75.46026 98.15983
## 383       86.81005 79.38882 94.23128 75.46026 98.15983
## 384       86.81005 79.38882 94.23128 75.46026 98.15983
## 385       86.81005 79.38882 94.23128 75.46026 98.15983
## 386       86.81005 79.38882 94.23128 75.46026 98.15983
## 387       86.81005 79.38882 94.23128 75.46026 98.15983
## 388       86.81005 79.38882 94.23128 75.46026 98.15983
## 389       86.81005 79.38882 94.23128 75.46026 98.15983
## 390       86.81005 79.38882 94.23128 75.46026 98.15983
## 391       86.81005 79.38882 94.23128 75.46026 98.15983
## 392       86.81005 79.38882 94.23128 75.46026 98.15983
## 393       86.81005 79.38882 94.23128 75.46026 98.15983
## 394       86.81005 79.38882 94.23128 75.46026 98.15983
## 395       86.81005 79.38882 94.23128 75.46026 98.15983

ATM 4

# Filter to ATM4 data
atm4_data <- df_atm %>%
  filter(ATM == 'ATM4')

nrow(atm4_data)
## [1] 365
head(atm4_data)
##         DATE  ATM Cash
## 1 2009-05-01 ATM4  777
## 2 2009-05-02 ATM4  524
## 3 2009-05-03 ATM4  793
## 4 2009-05-04 ATM4  908
## 5 2009-05-05 ATM4   53
## 6 2009-05-06 ATM4   52
atm4_data_ts <- atm4_data %>%
  mutate(DATE = as.Date(DATE, format = "%m/%d/%Y %I:%M:%S %p")) %>%
  as_tsibble(index = DATE)

Plot

plot(atm4_data_ts$DATE, atm4_data_ts$Cash, type="l",
     main="Cash (in hundreds $USD)")

Outlier

Because the extremely random value of 10920 was an outlier, I decided to remove this outlier.

outlier_values_atm4_data_ts <- boxplot.stats(atm4_data_ts$Cash)$out

#par(mfrow=c(2,2)) 

boxplot(atm4_data_ts$Cash, main = "ATM4", boxwex = 0.5)
mtext(paste("Outliers: ", paste(outlier_values_atm4_data_ts, collapse = ", ")), cex = 0.6)

Caluculate Outlier to remove

boxplot_stats <- boxplot.stats(atm4_data_ts$Cash)

outliers <- boxplot_stats$out

atm4_data_cleaned <- atm4_data_ts[!atm4_data_ts$Cash %in% outliers, ]

boxplot(atm4_data_cleaned$Cash, main = "ATM4 Cash Transactions without Outliers", boxwex = 0.5)

plot(atm4_data_cleaned$DATE, atm4_data_cleaned$Cash, type="l",
     main="Cash (in hundreds $USD)")

Augmented Dickey-Fuller (ADF)

Dickey-Fuller value:

  • The test statistic is -5.6265. This value is negative, providing strong evidence that there is no unit root (implying non-stationarity). The more negative the statistic, the stronger the evidence for the absence of a unit root, which means the time series is more likely to be stationary.

Lag order:

  • The number of lags is set to 7. This represents the optimal number of lags taken into account when the ADF test is calculated. The number of delays is adjusted according to the data to increase the accuracy of the test.

p-value:

  • 0.01, which is very low. This value means that there is sufficient evidence to reject the null hypothesis (that the time series data are non-stationary with a unit root). Generally, if the p-value is less than 0.05 (5%), the null hypothesis is rejected and the data are considered normal. Here we reject the null hypothesis even at the 1% level, so there is very strong evidence that the data are normal.

This ATM data means that the time series maintains a constant mean and variance over time, and the data can be interpreted as having time series stationarity. Because it is a stationary data, decided that transformation is not necessary.

atm4_ts <- ts(atm4_data_cleaned$Cash, start=c(2009,5, as.numeric(format(min(atm4_data_cleaned$DATE), "%j"))), frequency=7)

adf_test_result4 <- adf.test(atm4_ts, alternative = "stationary")

print(adf_test_result4)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm4_ts
## Dickey-Fuller = -5.6265, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Moving Average

The moving average stays within a stable range, suggesting a consistent trend in the time series data.

atm4_data_ts <- atm4_data_ts %>%
  filter(Cash != 10920)

atm4_data_ts <- atm4_data_ts %>%
  mutate(Moving_Average = rollapply(Cash, 7, mean, partial = TRUE, align = 'right'))

#print(atm4_data_ts, n = Inf)

STL

Because the data is for one year, frequency 365 is not possible. Therefore, check weekly data with frequency 7.

This graph illustrates a time series that exhibits a clear seasonal pattern, as evidenced by the regular fluctuations in the seasonal component. The trend component indicates that the underlying trend of the series is relatively stable, with occasional increases or decreases over time. In the residual panel, there do not appear to be any major outliers present, suggesting that most of the systematic variation has been captured by the trend and seasonal components.

atm4_decomposed <- stl(atm4_ts, s.window="periodic")

plot(atm4_decomposed)

ACF & PACF

The ACF plot shows a gradual decline and the PACF plot doesn’t show significant spikes, suggesting that the ATM4 time series may be best modeled with an ARIMA model rather than a pure AR or MA model.

par(mfrow = c(2, 1))

# Plotting the ACF for Amazon's closing stock prices
acf(atm4_ts, main="ACF for ATM4")

# Plotting the PACF for Amazon's closing stock prices
pacf(atm4_ts, main="PACF for ATM4")

ATM4 Model

The median lower bound of the 95% prediction interval is approximately 7.00, and the median upper bound is approximately 150.92.

arima4 <- auto.arima(atm4_ts, seasonal = TRUE) 
summary(arima4)
## Series: atm4_ts 
## ARIMA(2,0,2)(2,0,2)[7] with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2    sar1    sar2    sma1     sma2
##       1.8520  -0.8824  -1.7913  0.8184  0.0280  0.8230  0.0715  -0.6947
## s.e.  0.1332   0.1288   0.1646  0.1602  0.1661  0.1553  0.1820   0.1555
##           mean
##       437.6635
## s.e.   34.3522
## 
## sigma^2 = 111258:  log likelihood = -2620.37
## AIC=5260.74   AICc=5261.37   BIC=5299.69
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 0.5806187 329.3931 269.7137 -479.831 511.4238 0.7835461
##                     ACF1
## Training set 0.002777017
checkresiduals(arima4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(2,0,2)[7] with non-zero mean
## Q* = 21.675, df = 6, p-value = 0.001386
## 
## Model df: 8.   Total lags used: 14
  • Generally, the lower the AIC and BIC and the higher the log likelihood, the better the model fit. When comparing the two models, the AIC and BIC of the ARIMA(0,1,2) model are higher and the log likelihood is lower, so the ARIMA(2,0,2) model is judged to have a better fit.

  • The residuals of the ARIMA(2,0,2) model are slightly more centered and show lower ACF values.

  • The ARIMA(2,0,2) model shows slightly lower autocorrelation.

model6 <- Arima(atm4_ts, order = c(2,0,2), include.drift = TRUE)
model6
## Series: atm4_ts 
## ARIMA(2,0,2) with drift 
## 
## Coefficients:
##           ar1     ar2     ma1      ma2  intercept    drift
##       -0.1070  0.5131  0.2000  -0.5334   464.9270  -0.1267
## s.e.   0.5626  0.4117  0.5677   0.4596    40.3717   0.1922
## 
## sigma^2 = 119165:  log likelihood = -2633.49
## AIC=5280.98   AICc=5281.29   BIC=5308.24
checkresiduals(model6)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2) with drift
## Q* = 31.792, df = 10, p-value = 0.0004337
## 
## Model df: 4.   Total lags used: 14
fore4 <- forecast(model6, h = 30)
plot(fore4, col = "black", lwd=2, flty=3, shadecols = c("gray", "mistyrose"))

Forecast

Therefore, the median lower bound of the 95% prediction interval is approximately -265.6057, and the median upper bound is approximately 1097.671.

fore4
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## 2061.429       432.3302 -10.06509 874.7254 -244.2550 1108.915
## 2061.571       395.4158 -48.88845 839.7201 -284.0889 1074.921
## 2061.714       427.9827 -16.52326 872.4887 -251.8305 1107.796
## 2061.857       405.4834 -39.59376 850.5606 -275.2034 1086.170
## 2062.000       424.5244 -20.64950 869.6984 -256.3103 1105.359
## 2062.143       410.8682 -34.48282 856.2193 -270.2374 1091.974
## 2062.286       422.0234 -23.36946 867.4163 -259.1462 1103.193
## 2062.429       413.7481 -31.70136 859.1975 -267.5080 1095.004
## 2062.571       420.2816 -25.18482 865.7480 -261.0005 1101.564
## 2062.714       415.2615 -30.22343 860.7464 -266.0489 1096.572
## 2062.857       419.0755 -26.41604 864.5670 -262.2450 1100.396
## 2063.000       416.0165 -29.48119 861.5142 -265.3134 1097.346
## 2063.143       418.2254 -27.27481 863.7256 -263.1084 1099.559
## 2063.286       416.3443 -29.15795 861.8466 -264.9926 1097.681
## 2063.429       417.6037 -27.89956 863.1069 -263.7347 1098.942
## 2063.571       416.4286 -29.07537 861.9325 -264.9109 1097.768
## 2063.714       417.1252 -28.37912 862.6295 -264.2148 1098.465
## 2063.857       416.3725 -29.13205 861.8770 -264.9679 1097.713
## 2064.000       416.7352 -28.76949 862.2398 -264.6054 1098.076
## 2064.143       416.2349 -29.26981 861.7397 -265.1058 1097.576
## 2064.286       416.3993 -29.10549 861.9041 -264.9415 1097.740
## 2064.429       416.0498 -29.45501 861.5546 -265.2910 1097.391
## 2064.571       416.0963 -29.40856 861.6011 -265.2446 1097.437
## 2064.714       415.8367 -29.66810 861.3416 -265.5041 1097.178
## 2064.857       415.8131 -29.69174 861.3180 -265.5278 1097.154
## 2065.000       415.6072 -29.89762 861.1121 -265.7336 1096.948
## 2065.143       415.5419 -29.96297 861.0467 -265.7990 1096.883
## 2065.286       415.3680 -30.13686 860.8729 -265.9729 1096.709
## 2065.429       415.2778 -30.22703 860.7827 -266.0631 1096.619
## 2065.571       415.1230 -30.38184 860.6279 -266.2179 1096.464

Part B

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

df_power <- read.csv("C:/Users/SeungminSong/Downloads/624R/ResidentialCustomerForecastLoad-624.csv")

head(df_power)
##   CaseSequence YYYY.MMM     KWH
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147
names(df_power)[names(df_power) == "YYYY.MMM"] <- "DATE"
head(df_power)
##   CaseSequence     DATE     KWH
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147
df_power$DATE <- as.Date(paste0(df_power$DATE, "-01"), format="%Y-%b-%d")

df_power_tsibble <- as_tsibble(df_power, index = DATE)

#print(df_power_tsibble, n = Inf)

Plot

plot(df_power_tsibble$DATE, df_power_tsibble$KWH, type = "l", main = "Power Usage",
     xlab = "Date", ylab = "KWH")

Find missing value

sum(is.na(df_power_tsibble))
## [1] 1

One way to deal with missing values is to use the average of the surrounding values. Missing values, denoted as NA, were found and replaced with the average of the values of the previous and next month.

na_index <- which(is.na(df_power_tsibble$KWH))


for (i in na_index) {
  if (i > 1 && i < nrow(df_power_tsibble)) { 
    df_power_tsibble$KWH[i] <- mean(c(df_power_tsibble$KWH[i-1], df_power_tsibble$KWH[i+1]), na.rm = TRUE)
  } else if (i == 1) { 
    df_power_tsibble$KWH[i] <- df_power_tsibble$KWH[i+1] 
  } else if (i == nrow(df_power_tsibble)) { 
    df_power_tsibble$KWH[i] <- df_power_tsibble$KWH[i-1] 
  }
}

Outliers

Quartiles method

The quartile method is a method that sets an approximate outlier interval using data distribution and value size. Because it is simple and intuitive, it is widely used in outlier removal. The results show that outliers have been stably removed.

outlier_values_df_power <- boxplot.stats(df_power_tsibble$KWH)$out


boxplot(df_power_tsibble$KWH, main = "KWH", boxwex = 0.5)
mtext(paste("Outliers (KWH): ", paste(outlier_values_df_power, collapse = ", ")), cex = 0.6)

# Calculate the first (Q1) and third (Q3) quartiles
Q1 <- quantile(df_power_tsibble$KWH, 0.25)
Q3 <- quantile(df_power_tsibble$KWH, 0.75)

# Calculate the Interquartile Range (IQR)
IQR <- Q3 - Q1

# Define the bounds for outliers, typically 1.5 times the IQR from the Q1 and Q3
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Find and save the outliers
outliers <- df_power_tsibble$KWH < lower_bound | df_power_tsibble$KWH > upper_bound
outliers_data <- df_power_tsibble[outliers, ]


df_power_tsibble <- df_power_tsibble[!outliers, ]

# Print or return the outliers
outliers_data
## # A tsibble: 1 x 3 [1D]
##   CaseSequence DATE          KWH
##          <int> <date>      <dbl>
## 1          883 2010-07-01 770523
plot(df_power_tsibble$DATE, df_power_tsibble$KWH, type = "l", main = "Power Usage",
     xlab = "Date", ylab = "KWH")

Stationarity

Dickey-Fuller value:

  • -4.5069, which is negative, providing evidence that there is no unit root. Because the statistic is more negative, it can be seen as strong evidence to reject the null hypothesis (non-normality).

Lag order:

  • The number of lags for data is set to 5. This represents the optimal number of lags considered in the ADF test.

p-value:

  • 0.01, which is highly significant at the 1% level. This means that there is sufficient evidence to reject the null hypothesis (that the time series has a unit root and is nonstationary), and therefore the time series data can be interpreted as stationary.

In conclusion, this ADF test result means that the time series maintains constant mean and variance over time, providing strong evidence that this data has time series stationarity.

df_power_ts <- ts(df_power_tsibble$KWH, start=c(1998,1), end=c(2013,12), frequency=12)

adf_test_result <- adf.test(df_power_ts, alternative = "stationary")


print(adf_test_result)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  df_power_ts
## Dickey-Fuller = -4.7514, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

ssssssssss

start(df_power_ts)
## [1] 1998    1
end(df_power_ts)
## [1] 2013   12

Seasonal

ggseasonplot(df_power_ts, DATE.labels=TRUE, DATE.labels.left=F) + 
  ggtitle("Seasonal Plot")

Transformed

Use kpass to determine the appropriate differential order for the Turnover column. Looking at the results, the number was 1.

When comparing non-transformed with differenced data:

  • The differencing seems to reduce the autocorrelation in the ACF plot, indicating an improvement towards stationarity.
  • The PACF after differencing does not show clear spikes, suggesting that differencing has accounted for some of the autoregressive structure in the data.
  • Despite differencing, the ACF indicates potential seasonality that hasn’t been fully accounted for, seen in the somewhat regular pattern of the ACF spikes, which may suggest the need for seasonal differencing or a seasonal ARIMA model.
ndiffs_diff <- ndiffs(df_power_tsibble$KWH, test = "kpss")
ndiffs_diff
## [1] 1
power_diff <- diff(df_power_tsibble$KWH, differences = ndiffs_diff)
lambda_opt <- BoxCox.lambda(df_power_tsibble$KWH)
df_power_transformed <- BoxCox(df_power_tsibble$KWH, lambda_opt)
df_power_diff <- diff(df_power_transformed)
df_power_seasonally_diff <- diff(df_power_transformed, differences = 1, lag = 12)


par(mfrow=c(2,2))

plot(df_power_tsibble$DATE, df_power_tsibble$KWH, type="l",
     main="Non-transformed KWH")

plot(df_power_tsibble$DATE[-1], df_power_diff, type="l",
     main=paste("Differenced KWH with λ =", round(lambda_opt, 2)))


acf(df_power_tsibble$KWH, main="ACF for Non-transformed KWH")
acf(df_power_diff, main=paste("ACF for Differenced KWH\n with λ =", round(lambda_opt, 2)))

pacf(df_power_tsibble$KWH, main="PACF for Non-transformed KWH")
pacf(df_power_diff, main=paste("PACF for Differenced KWH\n with λ =", round(lambda_opt, 2)))

STL Decompose

The graph indeed indicates the presence of both a seasonal component and a trend. Therefore, it seems more appropriate to use df_power_seasonally_diff in ARIMA.

power_ts <- ts(df_power_tsibble$KWH, start = c(1998, 1), end=c(2013,12), frequency=12)

power_decomposed <- stl(power_ts, s.window="periodic")

plot(power_decomposed)

Augmented Dickey-Fuller (ADF)

Dickey-Fuller value:

  • The test statistic is -4.7639. This value is negative, providing strong evidence that there is no unit root (implying non-stationarity). The more negative the statistic, the stronger the evidence for the absence of a unit root, which means the time series is more likely to be stationary. Lag order:**

  • The number of lags is set to 5. This represents the optimal number of lags taken into account when the ADF test is calculated. The number of delays is adjusted according to the data to increase the accuracy of the test.

  • p-value: 0.01, which is very low. This value means that there is sufficient evidence to reject the null hypothesis (that the time series data are non-stationary with a unit root). Generally, if the p-value is less than 0.05 (5%), the null hypothesis is rejected and the data are considered normal. Here we reject the null hypothesis even at the 1% level, so there is very strong evidence that the data are normal.**

This time series data can be judged to be a statistically normal time series, and time series analysis and predictive modeling based on it may be appropriate. Stationary time series data are a good basis for analysis or forecasting using models such as ARIMA.

df_power_diff_ts <- ts(df_power_seasonally_diff, start = c(1998, 1), end=c(2013,12), frequency = 12)

adf_test_result <- adf.test(df_power_diff_ts, alternative = "stationary")

print(adf_test_result)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  df_power_diff_ts
## Dickey-Fuller = -4.7639, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

ACF & PACF

Given that the ACF gradually decreases as the lag increases and the PACF shows no significant spikes, this might indicate that the time series data is more suited for an ARIMA model rather than a pure AR or MA model.

par(mfrow = c(2, 1))

# Plotting the ACF for Amazon's closing stock prices
acf(df_power_diff_ts, main="ACF for KWH")

# Plotting the PACF for Amazon's closing stock prices
pacf(df_power_diff_ts, main="PACF for KWH")

Residuals

#df_power_diff_ts <- ts(df_power_seasonally_diff, start = c(1998, 1), frequency = 12)

# Fit the ARIMA model
fit <- auto.arima(df_power_diff_ts, seasonal = TRUE)

# Summary and residual checks
summary(fit)
## Series: df_power_diff_ts 
## ARIMA(0,0,1)(1,0,0)[12] with zero mean 
## 
## Coefficients:
##          ma1     sar1
##       0.3301  -0.4010
## s.e.  0.0727   0.0667
## 
## sigma^2 = 5.068e-11:  log likelihood = 2003.19
## AIC=-4000.37   AICc=-4000.24   BIC=-3990.6
## 
## Training set error measures:
##                        ME         RMSE          MAE       MPE     MAPE
## Training set 6.923498e-07 7.081833e-06 5.371889e-06 -3399.037 3683.803
##                   MASE        ACF1
## Training set 0.4993653 -0.02169511
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(1,0,0)[12] with zero mean
## Q* = 40.72, df = 22, p-value = 0.008896
## 
## Model df: 2.   Total lags used: 24
model <- ar(df_power_diff_ts)

# Extract the residuals from the model
residuals <- model$resid

# Check if residuals are present
if (!is.null(residuals)) {
  # Create Q-Q plot of residuals
  qqnorm(residuals)
  qqline(residuals)
} else {
  stop("No residuals to plot. Please check your model fitting process.")
}

shapiro.test(residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.9885, p-value = 0.1522

Model

According to the AIC value, the ARIMA(0,0,1) model has lower values, which indicates that this model fits the data better. Both RMSE and MAE are lower in this model, MASE is slightly lower, and ACF1 is also closer to 0, meaning there is less autocorrelation left in the residuals.

fit_drift <- Arima(df_power_diff_ts, order=c(0,0,1))

summary(fit_drift)
## Series: df_power_diff_ts 
## ARIMA(0,0,1) with non-zero mean 
## 
## Coefficients:
##          ma1   mean
##       0.3296  0e+00
## s.e.  0.0701  1e-04
## 
## sigma^2 = 6.043e-11:  log likelihood = 1987.35
## AIC=-3968.7   AICc=-3968.57   BIC=-3958.92
## 
## Training set error measures:
##                         ME         RMSE          MAE      MPE     MAPE
## Training set -2.483219e-09 7.733025e-06 5.998467e-06 1456.816 1540.194
##                   MASE        ACF1
## Training set 0.5576113 0.002949623
checkresiduals(fit_drift)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1) with non-zero mean
## Q* = 66.52, df = 23, p-value = 4.139e-06
## 
## Model df: 1.   Total lags used: 24
forecast_fit_drift <- forecast(fit_drift, h = 12)
plot(forecast_fit_drift)

fit <- Arima(df_power_diff_ts, order=c(1,0,0))

summary(fit)
## Series: df_power_diff_ts 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1   mean
##       0.3052  0e+00
## s.e.  0.0685  1e-04
## 
## sigma^2 = 6.08e-11:  log likelihood = 1986.77
## AIC=-3967.54   AICc=-3967.41   BIC=-3957.77
## 
## Training set error measures:
##                         ME        RMSE         MAE     MPE     MAPE      MASE
## Training set -4.087088e-09 7.75666e-06 5.99675e-06 1380.77 1453.868 0.5574517
##                    ACF1
## Training set 0.02356285
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0) with non-zero mean
## Q* = 67.025, df = 23, p-value = 3.471e-06
## 
## Model df: 1.   Total lags used: 24
forecast_fit <- forecast(fit, h = 12)
plot(forecast_fit)

Forecast

Arima(0,0,1) Forecast

forecast_fit_drift
##          Point Forecast         Lo 80        Hi 80         Lo 95        Hi 95
## Jan 2014   6.666831e-07 -9.295610e-06 1.062898e-05 -1.456933e-05 1.590270e-05
## Feb 2014   5.726816e-07 -9.916689e-06 1.106205e-05 -1.546943e-05 1.661479e-05
## Mar 2014   5.726816e-07 -9.916689e-06 1.106205e-05 -1.546943e-05 1.661479e-05
## Apr 2014   5.726816e-07 -9.916689e-06 1.106205e-05 -1.546943e-05 1.661479e-05
## May 2014   5.726816e-07 -9.916689e-06 1.106205e-05 -1.546943e-05 1.661479e-05
## Jun 2014   5.726816e-07 -9.916689e-06 1.106205e-05 -1.546943e-05 1.661479e-05
## Jul 2014   5.726816e-07 -9.916689e-06 1.106205e-05 -1.546943e-05 1.661479e-05
## Aug 2014   5.726816e-07 -9.916689e-06 1.106205e-05 -1.546943e-05 1.661479e-05
## Sep 2014   5.726816e-07 -9.916689e-06 1.106205e-05 -1.546943e-05 1.661479e-05
## Oct 2014   5.726816e-07 -9.916689e-06 1.106205e-05 -1.546943e-05 1.661479e-05
## Nov 2014   5.726816e-07 -9.916689e-06 1.106205e-05 -1.546943e-05 1.661479e-05
## Dec 2014   5.726816e-07 -9.916689e-06 1.106205e-05 -1.546943e-05 1.661479e-05