Data Preprocessing
Load data
the data only contains 7 variables, but there are 10572 records.
## 'data.frame': 10572 obs. of 7 variables:
## $ SeriesInd: num 40669 40669 40669 40669 40669 ...
## $ category : chr "S03" "S02" "S01" "S06" ...
## $ Var01 : num 30.6 10.3 26.6 27.5 69.3 ...
## $ Var02 : num 1.23e+08 6.09e+07 1.04e+07 3.93e+07 2.78e+07 ...
## $ Var03 : num 30.3 10.1 25.9 26.8 68.2 ...
## $ Var05 : num 30.5 10.2 26.2 27 68.7 ...
## $ Var07 : num 30.6 10.3 26 27.3 69.2 ...
Split data
As requests, the last 140 period needs to be reserved as true value, so that the forecasts accuracy can be computed. The actual length of our data is 9732.
Based on summary report, there are several missing values except SeriesInd and category. The imputation action should be taken.
## SeriesInd category Var01 Var02
## Min. :40669 Length:9732 Min. : 9.03 Min. : 1339900
## 1st Qu.:41253 Class :character 1st Qu.: 23.10 1st Qu.: 12520675
## Median :41846 Mode :character Median : 38.44 Median : 21086550
## Mean :41843 Mean : 46.98 Mean : 37035741
## 3rd Qu.:42430 3rd Qu.: 66.78 3rd Qu.: 42486700
## Max. :43021 Max. :195.18 Max. :480879500
## NA's :14 NA's :2
## Var03 Var05 Var07
## Min. : 8.82 Min. : 8.99 Min. : 8.92
## 1st Qu.: 22.59 1st Qu.: 22.91 1st Qu.: 22.88
## Median : 37.66 Median : 38.05 Median : 38.05
## Mean : 46.12 Mean : 46.55 Mean : 46.56
## 3rd Qu.: 65.88 3rd Qu.: 66.38 3rd Qu.: 66.31
## Max. :189.36 Max. :195.00 Max. :189.72
## NA's :26 NA's :26 NA's :26
Visualization before imputation
distributions
from the plot of individual distribution of variable regardless of
different categories, Var02 is more problematic, which
needs further investigation. But Var01, Var03,
Var05, Var07 have extreme outliers, therefore,
we need to remove them first.
correlations
except var02 has no relationship with others, the rest of variables are highly correlated. therefore, linear regression can help in impute missing values.
Imputation
# use linear to impute missing value
var01.imp <- req.data %>% select(Var01) %>% na_interpolation()
var02.imp <- req.data %>% select(Var02) %>% na_interpolation()
var03.imp <- req.data %>% select(Var03) %>% na_interpolation()
var05.imp <- req.data %>% select(Var05) %>% na_interpolation()
var07.imp <- req.data %>% select(Var07) %>% na_interpolation()
# make a copy of imputed data
req.data.cp <- req.data
req.data.cp <- req.data.cp
req.data.cp["Var01"] <- var01.imp
req.data.cp["Var02"] <- var02.imp
req.data.cp["Var03"] <- var03.imp
req.data.cp["Var05"] <- var05.imp
req.data.cp["Var07"] <- var07.imp
split data by category with imputation
if we carefully observe the data, even through there are 10572 records on the data, each category contains 1762 records. and we save 140 period on the end, there should be 1622 records for each category with each variable.
## 'data.frame': 1622 obs. of 13 variables:
## $ SeriesInd: num 40669 40670 40671 40672 40673 ...
## $ Var01_S01: num 26.6 26.3 26 25.8 26.3 ...
## $ Var02_S01: num 10369300 10943800 8933800 10775400 12875600 ...
## $ Var02_S02: num 6.09e+07 2.16e+08 2.00e+08 1.30e+08 1.30e+08 ...
## $ Var03_S02: num 10.1 10.4 11.1 11.3 11.5 ...
## $ Var05_S03: num 30.5 30.7 30.6 30.2 30 ...
## $ Var07_S03: num 30.6 30.6 30.1 30.1 30.3 ...
## $ Var01_S04: num 17.2 17.2 17.3 16.9 16.8 ...
## $ Var02_S04: num 16587400 11718100 16422000 31816300 15470000 ...
## $ Var02_S05: num 27809100 30174700 35044700 27192100 24891800 ...
## $ Var03_S05: num 68.2 68.8 69.3 69.4 69.2 ...
## $ Var05_S06: num 27 27.3 28 28.1 28.9 ...
## $ Var07_S06: num 27.3 28.1 28.1 29.1 28.9 ...
Visualization after split and impute data
the plot below proves that Var02 is problematic under
different categories. And most of variables need somewhat transformation
to make distribution turn normal.
the distribution looks much closer to normal.
# Time series formation
# Modeling
# var01_s01
var01_s01_ets <- ets(var01_s01)
var01_s01_arima <- auto.arima(var02_s01, stepwise = FALSE, approximation = FALSE)
# var02_s0
var02_s01_ets <- ets(var02_s01)
var02_s01_arima <- auto.arima(var02_s01, stepwise = FALSE, approximation = FALSE)
# var02_s02
var02_s02_ets <- ets(var02_s02)
var02_s02_arima <- auto.arima(var02_s02, stepwise = FALSE, approximation = FALSE)
# var03_s02
var03_s02_ets <- ets(var03_s02)
var03_s02_arima <- auto.arima(var03_s02, stepwise = FALSE, approximation = FALSE)
# var05_s03
var05_s03_ets <- ets(var05_s03)
var05_s03_arima <- auto.arima(var05_s03, stepwise = FALSE, approximation = FALSE)
# var07_s03
var07_s03_ets <- ets(var07_s03)
var07_s03_arima <- auto.arima(var07_s03, stepwise = FALSE, approximation = FALSE)
# var01_s04
var01_s04_ets <- ets(var01_s04)
var01_s04_arima <- auto.arima(var01_s04, stepwise = FALSE, approximation = FALSE)
# var02_s04
var02_s04_ets <- ets(var02_s04)
var02_s04_arima <- auto.arima(var02_s04, stepwise = FALSE, approximation = FALSE)
# var02_s05
var02_s05_ets <- ets(var02_s05)
var02_s05_arima <- auto.arima(var02_s05, stepwise = FALSE, approximation = FALSE)
# var03_s05
var03_s05_ets <- ets(var03_s05)
var03_s05_arima <- auto.arima(var03_s05, stepwise = FALSE, approximation = FALSE)
# var05_s06
var05_s06_ets <- ets(var05_s06)
var05_s06_arima <- auto.arima(var05_s06, stepwise = FALSE, approximation = FALSE)
# var07_s06
var07_s06_ets <- ets(var07_s06)
var07_s06_arima <- auto.arima(var07_s06, stepwise = FALSE, approximation = FALSE)
Model comparison
##
## Ljung-Box test
##
## data: Residuals from ETS(M,Ad,N)
## Q* = 11.612, df = 5, p-value = 0.0405
##
## Model df: 5. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 87.564, df = 8, p-value = 1.443e-15
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 114.22, df = 8, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,N)
## Q* = 197, df = 8, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ETS(M,A,N)
## Q* = 135.06, df = 6, p-value < 2.2e-16
##
## Model df: 4. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,N)
## Q* = 161.41, df = 8, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,N)
## Q* = 169.53, df = 8, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 90.992, df = 8, p-value = 3.331e-16
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 58.9, df = 8, p-value = 7.656e-10
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 4.3284, df = 8, p-value = 0.8263
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 9.7041, df = 8, p-value = 0.2864
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 12.686, df = 8, p-value = 0.1231
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2)
## Q* = 12.435, df = 6, p-value = 0.05295
##
## Model df: 4. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2)
## Q* = 12.435, df = 6, p-value = 0.05295
##
## Model df: 4. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2)
## Q* = 7.7748, df = 6, p-value = 0.2551
##
## Model df: 4. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2)
## Q* = 2.0363, df = 8, p-value = 0.9799
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2) with drift
## Q* = 6.2525, df = 7, p-value = 0.5106
##
## Model df: 3. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2) with drift
## Q* = 4.5964, df = 7, p-value = 0.7091
##
## Model df: 3. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2)
## Q* = 7.1369, df = 8, p-value = 0.5219
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,3)
## Q* = 3.851, df = 6, p-value = 0.6968
##
## Model df: 4. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2)
## Q* = 9.4701, df = 7, p-value = 0.2206
##
## Model df: 3. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 4.3193, df = 9, p-value = 0.8892
##
## Model df: 1. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)
## Q* = 2.123, df = 8, p-value = 0.977
##
## Model df: 2. Total lags used: 10
##
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,1)
## Q* = 2.0941, df = 7, p-value = 0.9544
##
## Model df: 3. Total lags used: 10
## model_name ETS_mape ARIMA_mape ETS_p ARIMA_p
## 1 var01_s01 0.3002 1.5043 0.0405 0.0529
## 2 var02_s01 1.5674 1.5043 0.0000 0.0529
## 3 var02_s02 0.0093 0.0090 0.0000 0.2551
## 4 var03_s02 0.3503 0.4252 0.0000 0.9799
## 5 var05_s03 1.1341 1.1323 0.0000 0.5106
## 6 var07_s03 1.1172 1.0936 0.0000 0.7091
## 7 var01_s04 0.2139 0.2374 0.0000 0.5219
## 8 var02_s04 0.0150 0.0142 0.0000 0.6968
## 9 var02_s05 0.0101 0.0099 0.0000 0.2206
## 10 var03_s05 2.8178 2.8169 0.8263 0.8892
## 11 var05_s06 0.4341 0.4332 0.2864 0.9770
## 12 var07_s06 0.4396 0.4372 0.1231 0.9544