Ekonometrika
~ Pertemuan 12 ~
| Kontak | : \(\downarrow\) |
| diyasaryanugroho@gmail.com | |
| https://www.instagram.com/diasary_nm/ | |
| RPubs | https://rpubs.com/diyasarya/ |
library(plotly)
library(tidyverse)
library(lubridate)
library(aTSA)
library(stats)
library(lmtest)
library(forecast)
library(car)
library(nortest)Simulated Data
set.seed(123)
n <- 100
start_date <- ymd("2022-01-01")
end_date <- ymd("2022-04-10")
dates <- seq(start_date, end_date, by = "day")
promotion_spending <- runif(n, min = 1000, max = 5000)
price <- rnorm(n, mean = 50, sd = 10)
weather_conditions <- sample(c("sunny", "cloudy", "rainy"), size = n, replace = TRUE)
sales_trend <- 0.1*seq(1,n)
seasonal_pattern <- sin(seq(1,n)*2*pi/365*7)*100
sales_noise <- rnorm(n, mean = 0, sd = 100)
sales_revenue <- 1000 + sales_trend + seasonal_pattern + sales_noise
simulated_data <- tibble(Date = dates,
Promotion_Spending = promotion_spending,
Price = price,
Weather_Conditions = weather_conditions,
Sales_Revenue = sales_revenue)
head(simulated_data)Regression
Regression is one of the statistical methods used to estimate the relationship between the dependent variable and one or more independent variables. Regression is also often used to perform simple predictions, by estimating how the dependent variable changes when the independent variable changes. In conducting regression analysis, there are classical tests that must be met, namely, normality tests, multicollinearity tests, homogeneity tests, and linearity tests.
Build Model
The first step, using simulation data I formed two regression models, the first by using the Sales Revenue variable as the dependent variable, and the independent variables were Weather Conditions, Price, and Promotion Spending.
simulated_data$Weather_Conditions <- as.factor(simulated_data$Weather_Conditions)
model1 <- lm(Sales_Revenue ~ Weather_Conditions*Price + Promotion_Spending, data = simulated_data)My second model uses the Sales Revenue variable as the dependent variable, and the independent variables are Price and Promotion Spending. Unlike the first model, the second model has no categorical variables for its independent variables.
Before interpreting the results of the regression model that has been formed, test the classical assumptions for both models that have been built.
Normality
Test the data normality hypothesis:
H0 : Residual data is
normally distributed
H1 : Data residuals are not normally
distributed
Reject the null hypothesis if the p-value is less than
0.05 or 0.01
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: model1$residuals
## D = 0.059853, p-value = 0.5089
As a result of the normality test with the Kolmogorov-Smirnov method, the first model has a p-value greater than 0.05 which means the residual data is normally distributed.
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: model2$residuals
## D = 0.061778, p-value = 0.4571
As a result of the normality test with the Kolmogorov-Smirnov method, the second model has a p-value greater than 0.05 which means the residual data are normally distributed.
Multicollinearity
This VIF test will provide more accurate information about the presence or absence of multicollinearity in multiple regression models. Generally, regression models where there is no multicollinearity will have a VIF value smaller than 10.
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## Weather_Conditions 816.200778 2 5.345019
## Price 3.591664 1 1.895168
## Promotion_Spending 1.102208 1 1.049861
## Weather_Conditions:Price 970.554758 2 5.581552
Based on the results of the VIF test in the first model, each independent variable has a VIF value smaller than 10. So the first regression model proved to have no multicollinearity problems.
## Price Promotion_Spending
## 1.000705 1.000705
Based on the results of the VIF test in the second model, each independent variable has a VIF value smaller than 10. So the second regression model proved to have no multicollinearity problems.
Homogeneity
Test the data homogeneity hypothesis:
H0 : Data residuals have
homogeneous or equal variances
H1 : Residuals data do not have the
same variance
Reject the null hypothesis if the p-value is less
than 0.05 or 0.01
##
## Breusch-Pagan test
##
## data: model1
## BP = 9.9747, df = 6, p-value = 0.1257
In the first model, the p-value is greater than 0.05 which means the residual data has the same variance.
##
## Breusch-Pagan test
##
## data: model2
## BP = 6.0613, df = 2, p-value = 0.04828
In the second model, the p-value is smaller than 0.05 which means reject the null hypothesis or the residuals data do not have the same variance. If the residuals data do not have the same variance then it is necessary to perform model transformations using the log method.
model2trans <- lm(log(Sales_Revenue) ~ Price + Promotion_Spending, data = simulated_data)
bptest(model2trans)##
## studentized Breusch-Pagan test
##
## data: model2trans
## BP = 5.908, df = 2, p-value = 0.05213
After transforming the model, the p-value is greater than 0.05 which means that the residual data has the same variance.
Linearity
Test the data linearity hypothesis:
H0 : The independent
variable has a linear relationship with the dependent variable
H1 :
The independent variable has no linear relationship with the dependent
variable
Reject the null hypothesis if the p-value is less than
0.05 or 0.01
##
## Rainbow test
##
## data: model1
## Rain = 0.78604, df1 = 50, df2 = 43, p-value = 0.7947
In the first model, the p-value is greater than 0.05 which means that the independent variable has a linear relationship with the dependent variable
##
## Rainbow test
##
## data: model2trans
## Rain = 0.78335, df1 = 50, df2 = 47, p-value = 0.8019
In the second model, the p-value is greater than 0.05 which means that the independent variable has a linear relationship with the dependent variable
Final Model
After all the classical assumption tests are met, then I will evaluate the model by looking at MAPE (Mean Absolute Percentage Error). The regression model with the smallest MAPE is the best model.
model1MAPE <- mean(abs(model1$residuals/model1$fitted.values))*100
model2MAPE <- mean(abs(model2trans$residuals/model2trans$fitted.values))*100
data.frame(Model = c(1:2),
MAPE = c(model1MAPE, model2MAPE))Of the two regression models that have been built, it turns out that the two models that have been transformed have the smallest MAPE at 1.54%. So that the model to be chosen to make predictions there are two models that have been transformed logs.
##
## Call:
## lm(formula = log(Sales_Revenue) ~ Price + Promotion_Spending,
## data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42418 -0.06854 0.01215 0.08605 0.29253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.866e+00 8.215e-02 83.583 <2e-16 ***
## Price -6.086e-04 1.442e-03 -0.422 0.6738
## Promotion_Spending 2.158e-05 1.220e-05 1.768 0.0801 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1383 on 97 degrees of freedom
## Multiple R-squared: 0.03336, Adjusted R-squared: 0.01343
## F-statistic: 1.674 on 2 and 97 DF, p-value: 0.1929
From the second regression modeling that has been transformed, logs
are obtained:
1. Regression equation : log(Y) = 6.866e+00 -
6.086e-04Price + 2.158e-05Promotion_Spending, Y(Sales_Revenue)
2.
F-statistic = 1.674 with a p-value greater than 0.05 which means that H0
is accepted or there is no independent variable that has a significant
intermediate influence on the dependent variable.
3. R-Squared =
0.033 which means that the independent variable is able to explain the
variance of the independent variable by 3.3%, the remaining 99.7% is
explained by other factors that are not contained in the regression
model or are not studied.
Because the second regression model does not have an independent
variable that has a significant effect on the dependent variable, the
regression model is not effective enough to forecast Sales Revenue even
though the MAPE is small.
Time Series
The ARIMA model is a combination of the AR (Autoregressive) model and the MA (Moving Average) model which is able to form a time series model for stationary data. The ARIMA model is generally denoted as ARIMA(p, d, q), where p is the order for the AR process, d is the first level of difference, and q is the order for the MA process.Forecasting using the ARIMA model popular since the early 1970s was introduced by Box and Jenkins, which is able to explain information for time series data. This time, I will do a Sales Revenue forecasting with the ARIMA method.
Data Preperation
Before forecasting, time series data needs to be recognized by R by
changing the date format to an index using the ts function.
Then, visually look at the time series data to see patterns that occur
in that data. In this case, it will do forecasting Sales
Revenue.
n <- simulated_data[,c(5)]
n <- ts(n, start = c(2022, 1), frequency = 365)
fig <- plot_ly(simulated_data, x = ~Date, y = ~Sales_Revenue, type = 'scatter', mode = 'lines')
figAt first glance from the plot, the time series data does not seem to have a significant pattern of change. Thus, it can be said that the time series data is stationary against variance. However, the data is not stationary against average. To handle this, time series data needs to be differencing.
Stationary Test
To prove that time series data is stationary or not, another way is
to use Augmented Dickey-Fuller Test.
The stationary test hypothesis of time series data using ADF is:
H0 : Non-stationary data
H1 : Stationary data
Reject the
null hypothesis if the p-value is less than 0.05 or 0.01
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -0.926 0.346
## [2,] 1 -0.602 0.463
## [3,] 2 -0.510 0.496
## [4,] 3 -0.379 0.534
## [5,] 4 -0.488 0.503
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -7.12 0.0100
## [2,] 1 -4.77 0.0100
## [3,] 2 -3.43 0.0135
## [4,] 3 -2.51 0.1310
## [5,] 4 -2.48 0.1417
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -7.32 0.0100
## [2,] 1 -4.98 0.0100
## [3,] 2 -3.59 0.0378
## [4,] 3 -2.64 0.3102
## [5,] 4 -2.55 0.3445
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
Focus on the results of the adf type 1 test, where each lag has a
p-value greater than 0.05 or 0.01 meaning that the data is not
stationary or does not successfully reject the null hypothesis.
So, it is necessary to make the first difference in time series
data so that the data is stationary.
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -16.02 0.01
## [2,] 1 -12.06 0.01
## [3,] 2 -10.30 0.01
## [4,] 3 -7.38 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -15.94 0.01
## [2,] 1 -12.00 0.01
## [3,] 2 -10.24 0.01
## [4,] 3 -7.35 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -15.86 0.01
## [2,] 1 -11.94 0.01
## [3,] 2 -10.18 0.01
## [4,] 3 -7.31 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
From the results of the adf test that has been done the first difference in the data, each lag has a p-value smaller than 0.05 or 0.01 which means the data is stationary.
AR and MA
Furthermore, to determine the order for AR (Autoregressive) and MA
(Moving Average) can be seen from the ACF and PACF plots. The ACF and
PACF criteria are as follows:
1. The AR(p) process, in the ACF
plot, shows a pattern that experiences an exponential decrease or sine
wave. While the PACF plot was truncated after the p-th lag.
2. The
MA(q) process, on the ACF plot is truncated after the q-th lag. While
the PACF plot shows a pattern that experiences an exponential decrease
or sine wave.
## Warning in par(nfrow = c(2, 1)): "nfrow" is not a graphical parameter
From the results of the ACF and PACF plots, it was found that the ACF plot was truncated at the 2nd lag and the PACF plot was truncated at the 4th lag. So, for the maximum AR order at the 4th order and the maximum MA order at the 2nd order.
Build Model
Once the stationary test is fulfilled, then we can proceed to build
the ARIMA model. in this case, I used the auto.arima()
function to search for the best possible ARIMA model built.
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(0,0,0) with non-zero mean : 1267.239
## ARIMA(1,0,0) with non-zero mean : 1260.957
## ARIMA(0,0,1) with non-zero mean : 1262.438
## ARIMA(0,0,0) with zero mean : 1669.639
## ARIMA(2,0,0) with non-zero mean : 1261.118
## ARIMA(1,0,1) with non-zero mean : 1251.499
## ARIMA(2,0,1) with non-zero mean : 1255.836
## ARIMA(1,0,2) with non-zero mean : 1253.492
## ARIMA(0,0,2) with non-zero mean : 1262.81
## ARIMA(1,0,1) with zero mean : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(1,0,1) with non-zero mean : 1251.466
##
## Best model: ARIMA(1,0,1) with non-zero mean
## Series: n
## ARIMA(1,0,1) with non-zero mean
##
## Coefficients:
## ar1 ma1 mean
## 0.9198 -0.7377 1001.994
## s.e. 0.0535 0.0803 36.507
##
## sigma^2 = 15050: log likelihood = -621.52
## AIC=1251.04 AICc=1251.47 BIC=1261.47
From the results of auto.arima(), the best ARIMA model is ARIMA(1,0,1).
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.919829 0.053516 17.1881 < 2.2e-16 ***
## ma1 -0.737705 0.080294 -9.1876 < 2.2e-16 ***
## intercept 1001.994294 36.506961 27.4467 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Forecast
Next, apply the ARIMA(1,0,1) model to the Sales Revenue forecast for April 11, 2022 or one day ahead.
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -1.929319 120.8257 94.75599 -1.80019 9.983429 NaN -0.0317043
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2022.2740 957.2805 800.0598 1114.501 716.8322 1197.729
Obtained from forecasting results using ARIMA(1,0,1) Sales Revenue on April 11, 2022 was 957.28(USD) with a 95% confidence interval in the range of 716.83 to 1197.73 and MAPE (Mean Absolute Percentage Error) of 9.98%.
Summary
Based on the results of regression analysis, the model built is not
effective enough to forecast Sales Revenue. As for ARIMA analysis, the
best model formed is ARIMA(1,0,1) with MAPE of 9.98%.
Sales Revenue forecast for the next day or April 11, 2022 is in
the range of 716.83 to 1197.73 with a prediction point of
957.28(USD).