Ekonometrika

~ Pertemuan 12 ~

Kontak	: \(\downarrow\)
Email	diyasaryanugroho@gmail.com
Instagram	https://www.instagram.com/diasary_nm/
RPubs	https://rpubs.com/diyasarya/

library(plotly)
library(tidyverse)
library(lubridate)
library(aTSA)
library(stats)
library(lmtest)
library(forecast)
library(car)
library(nortest)

Simulated Data

set.seed(123)

n <- 100

start_date <- ymd("2022-01-01")
end_date <- ymd("2022-04-10")
dates <- seq(start_date, end_date, by = "day")

promotion_spending <- runif(n, min = 1000, max = 5000)
price <- rnorm(n, mean = 50, sd = 10)
weather_conditions <- sample(c("sunny", "cloudy", "rainy"), size = n, replace = TRUE)

sales_trend <- 0.1*seq(1,n)
seasonal_pattern <- sin(seq(1,n)*2*pi/365*7)*100
sales_noise <- rnorm(n, mean = 0, sd = 100)
sales_revenue <- 1000 + sales_trend + seasonal_pattern + sales_noise

simulated_data <- tibble(Date = dates,
                         Promotion_Spending = promotion_spending,
                         Price = price,
                         Weather_Conditions = weather_conditions,
                         Sales_Revenue = sales_revenue)

head(simulated_data)

Regression

Regression is one of the statistical methods used to estimate the relationship between the dependent variable and one or more independent variables. Regression is also often used to perform simple predictions, by estimating how the dependent variable changes when the independent variable changes. In conducting regression analysis, there are classical tests that must be met, namely, normality tests, multicollinearity tests, homogeneity tests, and linearity tests.

Build Model

The first step, using simulation data I formed two regression models, the first by using the Sales Revenue variable as the dependent variable, and the independent variables were Weather Conditions, Price, and Promotion Spending.

simulated_data$Weather_Conditions <- as.factor(simulated_data$Weather_Conditions)
model1 <- lm(Sales_Revenue ~ Weather_Conditions*Price + Promotion_Spending, data = simulated_data)

My second model uses the Sales Revenue variable as the dependent variable, and the independent variables are Price and Promotion Spending. Unlike the first model, the second model has no categorical variables for its independent variables.

model2 <- lm(Sales_Revenue ~ Price + Promotion_Spending, data = simulated_data)

Before interpreting the results of the regression model that has been formed, test the classical assumptions for both models that have been built.

Normality

Test the data normality hypothesis:
H0 : Residual data is normally distributed
H1 : Data residuals are not normally distributed
Reject the null hypothesis if the p-value is less than 0.05 or 0.01

lillie.test(model1$residuals)

## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  model1$residuals
## D = 0.059853, p-value = 0.5089

As a result of the normality test with the Kolmogorov-Smirnov method, the first model has a p-value greater than 0.05 which means the residual data is normally distributed.

lillie.test(model2$residuals)

## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  model2$residuals
## D = 0.061778, p-value = 0.4571

As a result of the normality test with the Kolmogorov-Smirnov method, the second model has a p-value greater than 0.05 which means the residual data are normally distributed.

Multicollinearity

This VIF test will provide more accurate information about the presence or absence of multicollinearity in multiple regression models. Generally, regression models where there is no multicollinearity will have a VIF value smaller than 10.

vif(model1)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

##                                GVIF Df GVIF^(1/(2*Df))
## Weather_Conditions       816.200778  2        5.345019
## Price                      3.591664  1        1.895168
## Promotion_Spending         1.102208  1        1.049861
## Weather_Conditions:Price 970.554758  2        5.581552

Based on the results of the VIF test in the first model, each independent variable has a VIF value smaller than 10. So the first regression model proved to have no multicollinearity problems.

vif(model2)

##              Price Promotion_Spending 
##           1.000705           1.000705

Based on the results of the VIF test in the second model, each independent variable has a VIF value smaller than 10. So the second regression model proved to have no multicollinearity problems.

Homogeneity

Test the data homogeneity hypothesis:
H0 : Data residuals have homogeneous or equal variances
H1 : Residuals data do not have the same variance
Reject the null hypothesis if the p-value is less than 0.05 or 0.01

bptest(model1, studentize = FALSE)

## 
##  Breusch-Pagan test
## 
## data:  model1
## BP = 9.9747, df = 6, p-value = 0.1257

In the first model, the p-value is greater than 0.05 which means the residual data has the same variance.

bptest(model2, studentize = FALSE)

## 
##  Breusch-Pagan test
## 
## data:  model2
## BP = 6.0613, df = 2, p-value = 0.04828

In the second model, the p-value is smaller than 0.05 which means reject the null hypothesis or the residuals data do not have the same variance. If the residuals data do not have the same variance then it is necessary to perform model transformations using the log method.

model2trans <- lm(log(Sales_Revenue) ~ Price + Promotion_Spending, data = simulated_data)
bptest(model2trans)

## 
##  studentized Breusch-Pagan test
## 
## data:  model2trans
## BP = 5.908, df = 2, p-value = 0.05213

After transforming the model, the p-value is greater than 0.05 which means that the residual data has the same variance.

Linearity

Test the data linearity hypothesis:
H0 : The independent variable has a linear relationship with the dependent variable
H1 : The independent variable has no linear relationship with the dependent variable
Reject the null hypothesis if the p-value is less than 0.05 or 0.01

raintest(model1, fraction = 0.5, order.by = NULL, center = NULL, data = list())

## 
##  Rainbow test
## 
## data:  model1
## Rain = 0.78604, df1 = 50, df2 = 43, p-value = 0.7947

In the first model, the p-value is greater than 0.05 which means that the independent variable has a linear relationship with the dependent variable

raintest(model2trans, fraction = 0.5, order.by = NULL, center = NULL, data = list())

## 
##  Rainbow test
## 
## data:  model2trans
## Rain = 0.78335, df1 = 50, df2 = 47, p-value = 0.8019

In the second model, the p-value is greater than 0.05 which means that the independent variable has a linear relationship with the dependent variable

Final Model

After all the classical assumption tests are met, then I will evaluate the model by looking at MAPE (Mean Absolute Percentage Error). The regression model with the smallest MAPE is the best model.

model1MAPE <- mean(abs(model1$residuals/model1$fitted.values))*100
model2MAPE <- mean(abs(model2trans$residuals/model2trans$fitted.values))*100

data.frame(Model = c(1:2),
           MAPE = c(model1MAPE, model2MAPE))

Of the two regression models that have been built, it turns out that the two models that have been transformed have the smallest MAPE at 1.54%. So that the model to be chosen to make predictions there are two models that have been transformed logs.

summary(model2trans)

## 
## Call:
## lm(formula = log(Sales_Revenue) ~ Price + Promotion_Spending, 
##     data = simulated_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42418 -0.06854  0.01215  0.08605  0.29253 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.866e+00  8.215e-02  83.583   <2e-16 ***
## Price              -6.086e-04  1.442e-03  -0.422   0.6738    
## Promotion_Spending  2.158e-05  1.220e-05   1.768   0.0801 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1383 on 97 degrees of freedom
## Multiple R-squared:  0.03336,    Adjusted R-squared:  0.01343 
## F-statistic: 1.674 on 2 and 97 DF,  p-value: 0.1929

From the second regression modeling that has been transformed, logs are obtained:
1. Regression equation : log(Y) = 6.866e+00 - 6.086e-04Price + 2.158e-05Promotion_Spending, Y(Sales_Revenue)
2. F-statistic = 1.674 with a p-value greater than 0.05 which means that H0 is accepted or there is no independent variable that has a significant intermediate influence on the dependent variable.
3. R-Squared = 0.033 which means that the independent variable is able to explain the variance of the independent variable by 3.3%, the remaining 99.7% is explained by other factors that are not contained in the regression model or are not studied.

Because the second regression model does not have an independent variable that has a significant effect on the dependent variable, the regression model is not effective enough to forecast Sales Revenue even though the MAPE is small.

Time Series

The ARIMA model is a combination of the AR (Autoregressive) model and the MA (Moving Average) model which is able to form a time series model for stationary data. The ARIMA model is generally denoted as ARIMA(p, d, q), where p is the order for the AR process, d is the first level of difference, and q is the order for the MA process.Forecasting using the ARIMA model popular since the early 1970s was introduced by Box and Jenkins, which is able to explain information for time series data. This time, I will do a Sales Revenue forecasting with the ARIMA method.

Data Preperation

Before forecasting, time series data needs to be recognized by R by changing the date format to an index using the ts function. Then, visually look at the time series data to see patterns that occur in that data. In this case, it will do forecasting Sales Revenue.

n <- simulated_data[,c(5)]
n <- ts(n, start = c(2022, 1), frequency = 365)
fig <- plot_ly(simulated_data, x = ~Date, y = ~Sales_Revenue, type = 'scatter', mode = 'lines')

fig

At first glance from the plot, the time series data does not seem to have a significant pattern of change. Thus, it can be said that the time series data is stationary against variance. However, the data is not stationary against average. To handle this, time series data needs to be differencing.

Stationary Test

To prove that time series data is stationary or not, another way is to use Augmented Dickey-Fuller Test.

The stationary test hypothesis of time series data using ADF is:
H0 : Non-stationary data
H1 : Stationary data
Reject the null hypothesis if the p-value is less than 0.05 or 0.01

adf.test(n)

## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag    ADF p.value
## [1,]   0 -0.926   0.346
## [2,]   1 -0.602   0.463
## [3,]   2 -0.510   0.496
## [4,]   3 -0.379   0.534
## [5,]   4 -0.488   0.503
## Type 2: with drift no trend 
##      lag   ADF p.value
## [1,]   0 -7.12  0.0100
## [2,]   1 -4.77  0.0100
## [3,]   2 -3.43  0.0135
## [4,]   3 -2.51  0.1310
## [5,]   4 -2.48  0.1417
## Type 3: with drift and trend 
##      lag   ADF p.value
## [1,]   0 -7.32  0.0100
## [2,]   1 -4.98  0.0100
## [3,]   2 -3.59  0.0378
## [4,]   3 -2.64  0.3102
## [5,]   4 -2.55  0.3445
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

Focus on the results of the adf type 1 test, where each lag has a p-value greater than 0.05 or 0.01 meaning that the data is not stationary or does not successfully reject the null hypothesis.

So, it is necessary to make the first difference in time series data so that the data is stationary.

adf.test(diff(n))

## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag    ADF p.value
## [1,]   0 -16.02    0.01
## [2,]   1 -12.06    0.01
## [3,]   2 -10.30    0.01
## [4,]   3  -7.38    0.01
## Type 2: with drift no trend 
##      lag    ADF p.value
## [1,]   0 -15.94    0.01
## [2,]   1 -12.00    0.01
## [3,]   2 -10.24    0.01
## [4,]   3  -7.35    0.01
## Type 3: with drift and trend 
##      lag    ADF p.value
## [1,]   0 -15.86    0.01
## [2,]   1 -11.94    0.01
## [3,]   2 -10.18    0.01
## [4,]   3  -7.31    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

From the results of the adf test that has been done the first difference in the data, each lag has a p-value smaller than 0.05 or 0.01 which means the data is stationary.

AR and MA

Furthermore, to determine the order for AR (Autoregressive) and MA (Moving Average) can be seen from the ACF and PACF plots. The ACF and PACF criteria are as follows:
1. The AR(p) process, in the ACF plot, shows a pattern that experiences an exponential decrease or sine wave. While the PACF plot was truncated after the p-th lag.
2. The MA(q) process, on the ACF plot is truncated after the q-th lag. While the PACF plot shows a pattern that experiences an exponential decrease or sine wave.

par(nfrow = c(2,1))

## Warning in par(nfrow = c(2, 1)): "nfrow" is not a graphical parameter

acf(diff(n), lag=24)

pacf(diff(n), lag=24)

From the results of the ACF and PACF plots, it was found that the ACF plot was truncated at the 2nd lag and the PACF plot was truncated at the 4th lag. So, for the maximum AR order at the 4th order and the maximum MA order at the 2nd order.

Build Model

Once the stationary test is fulfilled, then we can proceed to build the ARIMA model. in this case, I used the auto.arima() function to search for the best possible ARIMA model built.

auto.arima(n, trace = TRUE)

## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2)             with non-zero mean : Inf
##  ARIMA(0,0,0)             with non-zero mean : 1267.239
##  ARIMA(1,0,0)             with non-zero mean : 1260.957
##  ARIMA(0,0,1)             with non-zero mean : 1262.438
##  ARIMA(0,0,0)             with zero mean     : 1669.639
##  ARIMA(2,0,0)             with non-zero mean : 1261.118
##  ARIMA(1,0,1)             with non-zero mean : 1251.499
##  ARIMA(2,0,1)             with non-zero mean : 1255.836
##  ARIMA(1,0,2)             with non-zero mean : 1253.492
##  ARIMA(0,0,2)             with non-zero mean : 1262.81
##  ARIMA(1,0,1)             with zero mean     : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(1,0,1)             with non-zero mean : 1251.466
## 
##  Best model: ARIMA(1,0,1)             with non-zero mean

## Series: n 
## ARIMA(1,0,1) with non-zero mean 
## 
## Coefficients:
##          ar1      ma1      mean
##       0.9198  -0.7377  1001.994
## s.e.  0.0535   0.0803    36.507
## 
## sigma^2 = 15050:  log likelihood = -621.52
## AIC=1251.04   AICc=1251.47   BIC=1261.47

From the results of auto.arima(), the best ARIMA model is ARIMA(1,0,1).

bestmodel <- Arima(n, order = c(1,0,1))
coeftest(bestmodel)

## 
## z test of coefficients:
## 
##              Estimate  Std. Error z value  Pr(>|z|)    
## ar1          0.919829    0.053516 17.1881 < 2.2e-16 ***
## ma1         -0.737705    0.080294 -9.1876 < 2.2e-16 ***
## intercept 1001.994294   36.506961 27.4467 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Forecast

Next, apply the ARIMA(1,0,1) model to the Sales Revenue forecast for April 11, 2022 or one day ahead.

accuracy(bestmodel)

##                     ME     RMSE      MAE      MPE     MAPE MASE       ACF1
## Training set -1.929319 120.8257 94.75599 -1.80019 9.983429  NaN -0.0317043

forecast(bestmodel, h=1)

##           Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2022.2740       957.2805 800.0598 1114.501 716.8322 1197.729

plot(forecast(bestmodel, h=1))

Obtained from forecasting results using ARIMA(1,0,1) Sales Revenue on April 11, 2022 was 957.28(USD) with a 95% confidence interval in the range of 716.83 to 1197.73 and MAPE (Mean Absolute Percentage Error) of 9.98%.

Summary

Based on the results of regression analysis, the model built is not effective enough to forecast Sales Revenue. As for ARIMA analysis, the best model formed is ARIMA(1,0,1) with MAPE of 9.98%.

Sales Revenue forecast for the next day or April 11, 2022 is in the range of 716.83 to 1197.73 with a prediction point of 957.28(USD).