624: ETS vs ARIMA

Jie Zou

2022-06-26

Data Preprocessing

Load data

the data only contains 7 variables, but there are 10572 records.

## 'data.frame':    10572 obs. of  7 variables:
##  $ SeriesInd: num  40669 40669 40669 40669 40669 ...
##  $ category : chr  "S03" "S02" "S01" "S06" ...
##  $ Var01    : num  30.6 10.3 26.6 27.5 69.3 ...
##  $ Var02    : num  1.23e+08 6.09e+07 1.04e+07 3.93e+07 2.78e+07 ...
##  $ Var03    : num  30.3 10.1 25.9 26.8 68.2 ...
##  $ Var05    : num  30.5 10.2 26.2 27 68.7 ...
##  $ Var07    : num  30.6 10.3 26 27.3 69.2 ...

Split data

As requests, the last 140 period needs to be reserved as true value, so that the forecasts accuracy can be computed. The actual length of our data is 9732.

Based on summary report, there are several missing values except SeriesInd and category. The imputation action should be taken.

##    SeriesInd       category             Var01            Var02          
##  Min.   :40669   Length:9732        Min.   :  9.03   Min.   :  1339900  
##  1st Qu.:41253   Class :character   1st Qu.: 23.10   1st Qu.: 12520675  
##  Median :41846   Mode  :character   Median : 38.44   Median : 21086550  
##  Mean   :41843                      Mean   : 46.98   Mean   : 37035741  
##  3rd Qu.:42430                      3rd Qu.: 66.78   3rd Qu.: 42486700  
##  Max.   :43021                      Max.   :195.18   Max.   :480879500  
##                                     NA's   :14       NA's   :2          
##      Var03            Var05            Var07       
##  Min.   :  8.82   Min.   :  8.99   Min.   :  8.92  
##  1st Qu.: 22.59   1st Qu.: 22.91   1st Qu.: 22.88  
##  Median : 37.66   Median : 38.05   Median : 38.05  
##  Mean   : 46.12   Mean   : 46.55   Mean   : 46.56  
##  3rd Qu.: 65.88   3rd Qu.: 66.38   3rd Qu.: 66.31  
##  Max.   :189.36   Max.   :195.00   Max.   :189.72  
##  NA's   :26       NA's   :26       NA's   :26

Visualization before imputation

distributions

from the plot of individual distribution of variable regardless of different categories, Var02 is more problematic, which needs further investigation. But Var01, Var03, Var05, Var07 have extreme outliers, therefore, we need to remove them first.

correlations

except var02 has no relationship with others, the rest of variables are highly correlated. therefore, linear regression can help in impute missing values.

Imputation

# use linear to impute missing value
var01.imp <- req.data %>% select(Var01) %>% na_interpolation()
var02.imp <- req.data %>% select(Var02) %>% na_interpolation()
var03.imp <- req.data %>% select(Var03) %>% na_interpolation()
var05.imp <- req.data %>% select(Var05) %>% na_interpolation()
var07.imp <- req.data %>% select(Var07) %>% na_interpolation()

# make a copy of imputed data
req.data.cp <- req.data 
req.data.cp <- req.data.cp
req.data.cp["Var01"] <- var01.imp
req.data.cp["Var02"] <- var02.imp
req.data.cp["Var03"] <- var03.imp
req.data.cp["Var05"] <- var05.imp
req.data.cp["Var07"] <- var07.imp

split data by category with imputation

if we carefully observe the data, even through there are 10572 records on the data, each category contains 1762 records. and we save 140 period on the end, there should be 1622 records for each category with each variable.

## 'data.frame':    1622 obs. of  13 variables:
##  $ SeriesInd: num  40669 40670 40671 40672 40673 ...
##  $ Var01_S01: num  26.6 26.3 26 25.8 26.3 ...
##  $ Var02_S01: num  10369300 10943800 8933800 10775400 12875600 ...
##  $ Var02_S02: num  6.09e+07 2.16e+08 2.00e+08 1.30e+08 1.30e+08 ...
##  $ Var03_S02: num  10.1 10.4 11.1 11.3 11.5 ...
##  $ Var05_S03: num  30.5 30.7 30.6 30.2 30 ...
##  $ Var07_S03: num  30.6 30.6 30.1 30.1 30.3 ...
##  $ Var01_S04: num  17.2 17.2 17.3 16.9 16.8 ...
##  $ Var02_S04: num  16587400 11718100 16422000 31816300 15470000 ...
##  $ Var02_S05: num  27809100 30174700 35044700 27192100 24891800 ...
##  $ Var03_S05: num  68.2 68.8 69.3 69.4 69.2 ...
##  $ Var05_S06: num  27 27.3 28 28.1 28.9 ...
##  $ Var07_S06: num  27.3 28.1 28.1 29.1 28.9 ...

Visualization after split and impute data

the plot below proves that Var02 is problematic under different categories. And most of variables need somewhat transformation to make distribution turn normal.

the distribution looks much closer to normal. # Time series formation

# Modeling

# var01_s01
var01_s01_ets <- ets(var01_s01)
var01_s01_arima <- auto.arima(var02_s01, stepwise = FALSE, approximation = FALSE)

# var02_s0
var02_s01_ets <- ets(var02_s01)
var02_s01_arima <- auto.arima(var02_s01, stepwise = FALSE, approximation = FALSE)

# var02_s02
var02_s02_ets <- ets(var02_s02)
var02_s02_arima <- auto.arima(var02_s02, stepwise = FALSE, approximation = FALSE)

# var03_s02
var03_s02_ets <- ets(var03_s02)
var03_s02_arima <- auto.arima(var03_s02, stepwise = FALSE, approximation = FALSE)

# var05_s03
var05_s03_ets <- ets(var05_s03)
var05_s03_arima <- auto.arima(var05_s03, stepwise = FALSE, approximation = FALSE)

# var07_s03
var07_s03_ets <- ets(var07_s03)
var07_s03_arima <- auto.arima(var07_s03, stepwise = FALSE, approximation = FALSE)

# var01_s04
var01_s04_ets <- ets(var01_s04)
var01_s04_arima <- auto.arima(var01_s04, stepwise = FALSE, approximation = FALSE)

# var02_s04
var02_s04_ets <- ets(var02_s04)
var02_s04_arima <- auto.arima(var02_s04, stepwise = FALSE, approximation = FALSE)

# var02_s05
var02_s05_ets <- ets(var02_s05)
var02_s05_arima <- auto.arima(var02_s05, stepwise = FALSE, approximation = FALSE)

# var03_s05
var03_s05_ets <- ets(var03_s05)
var03_s05_arima <- auto.arima(var03_s05, stepwise = FALSE, approximation = FALSE)

# var05_s06
var05_s06_ets <- ets(var05_s06)
var05_s06_arima <- auto.arima(var05_s06, stepwise = FALSE, approximation = FALSE)

# var07_s06
var07_s06_ets <- ets(var07_s06)
var07_s06_arima <- auto.arima(var07_s06, stepwise = FALSE, approximation = FALSE)

Model comparison

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,Ad,N)
## Q* = 11.612, df = 5, p-value = 0.0405
## 
## Model df: 5.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 87.564, df = 8, p-value = 1.443e-15
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 114.22, df = 8, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 197, df = 8, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,A,N)
## Q* = 135.06, df = 6, p-value < 2.2e-16
## 
## Model df: 4.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 161.41, df = 8, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 169.53, df = 8, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 90.992, df = 8, p-value = 3.331e-16
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 58.9, df = 8, p-value = 7.656e-10
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 4.3284, df = 8, p-value = 0.8263
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 9.7041, df = 8, p-value = 0.2864
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 12.686, df = 8, p-value = 0.1231
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)
## Q* = 12.435, df = 6, p-value = 0.05295
## 
## Model df: 4.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)
## Q* = 12.435, df = 6, p-value = 0.05295
## 
## Model df: 4.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)
## Q* = 7.7748, df = 6, p-value = 0.2551
## 
## Model df: 4.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)
## Q* = 2.0363, df = 8, p-value = 0.9799
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 6.2525, df = 7, p-value = 0.5106
## 
## Model df: 3.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 4.5964, df = 7, p-value = 0.7091
## 
## Model df: 3.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)
## Q* = 7.1369, df = 8, p-value = 0.5219
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,3)
## Q* = 3.851, df = 6, p-value = 0.6968
## 
## Model df: 4.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2)
## Q* = 9.4701, df = 7, p-value = 0.2206
## 
## Model df: 3.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 4.3193, df = 9, p-value = 0.8892
## 
## Model df: 1.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 2.123, df = 8, p-value = 0.977
## 
## Model df: 2.   Total lags used: 10
## 
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)
## Q* = 2.0941, df = 7, p-value = 0.9544
## 
## Model df: 3.   Total lags used: 10

##    model_name ETS_mape ARIMA_mape  ETS_p ARIMA_p
## 1   var01_s01   0.3002     1.5043 0.0405  0.0529
## 2   var02_s01   1.5674     1.5043 0.0000  0.0529
## 3   var02_s02   0.0093     0.0090 0.0000  0.2551
## 4   var03_s02   0.3503     0.4252 0.0000  0.9799
## 5   var05_s03   1.1341     1.1323 0.0000  0.5106
## 6   var07_s03   1.1172     1.0936 0.0000  0.7091
## 7   var01_s04   0.2139     0.2374 0.0000  0.5219
## 8   var02_s04   0.0150     0.0142 0.0000  0.6968
## 9   var02_s05   0.0101     0.0099 0.0000  0.2206
## 10  var03_s05   2.8178     2.8169 0.8263  0.8892
## 11  var05_s06   0.4341     0.4332 0.2864  0.9770
## 12  var07_s06   0.4396     0.4372 0.1231  0.9544