Tugas 3 Ekonometrika

Pertemuan ke-12

Naftali Brigitta Gunawan

May 06, 2024

Kontak	\(\downarrow\)
Email	naftaligunawan@gmail.com
Instagram	https://www.instagram.com/nbrigittag/
RPubs	https://rpubs.com/naftalibrigitta/
Nama	Naftali Brigitta Gunawan
NIM	20214920002

Dataset

The dataset consists of historical sales data collected over the past two years, including information on sales revenue, promotional activities, pricing strategies, weather conditions, and holidays.

The dataset consists of historical sales data collected over the past two years, including information on :

Date: Date of Sale
Promotional_Spending: Amount of Promotional Spending
Price: Product Price
Weather_Condition: Weather Condition
Sales_Revenue : Total Sales Revenue

# Load required libraries
library(tidyverse)
library(lubridate)

# Set seed for reproducibility
set.seed(123)

# Number of observations
n <- 100

# Simulate date range
start_date <- ymd("2022-01-01")
end_date <- ymd("2022-04-10")
dates <- seq(start_date, end_date, by = "day")

# Simulate predictor variables
promotional_spending <- runif(n, min = 1000, max = 5000)
price <- rnorm(n, mean = 50, sd = 10)
weather_conditions <- sample(c("sunny", "cloudy", "rainy"), size = n, replace = TRUE)

# Simulate sales revenue
sales_trend <- 0.1 * seq(1, n)
seasonal_pattern <- sin(seq(1, n) * 2 * pi / 365 * 7) * 100
sales_noise <- rnorm(n, mean = 0, sd = 100)
sales_revenue <- 1000 + sales_trend + seasonal_pattern + sales_noise

# Create dataframe
simulated_data <- tibble(
  Date = dates,
  Promotional_Spending = promotional_spending,
  Price = price,
  Weather_Conditions = weather_conditions,
  Sales_Revenue = sales_revenue
)

# Display the first few rows of the dataset
simulated_data

Question 1

Develop a regression model to understand the relationship between sales revenue and various predictors such as promotional spending, pricing, and external factors.

library(dplyr)
sum(is.na(simulated_data))

## [1] 0

There’s not found NULL data in this data.

# Data ready to simulated
simulated_data <- simulated_data %>%
  mutate(Weather_Conditions = as.factor(Weather_Conditions))

# Using lm() for linear model regression
reg_model <- lm(Sales_Revenue ~ Promotional_Spending + Price + Weather_Conditions, data = simulated_data)
summary(reg_model)

## 
## Call:
## lm(formula = Sales_Revenue ~ Promotional_Spending + Price + Weather_Conditions, 
##     data = simulated_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -340.29  -71.58   11.21   82.72  315.61 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             992.92772   81.16696  12.233   <2e-16 ***
## Promotional_Spending      0.01922    0.01197   1.605    0.112    
## Price                    -0.64144    1.42687  -0.450    0.654    
## Weather_Conditionsrainy -22.74556   36.17645  -0.629    0.531    
## Weather_Conditionssunny -23.58060   32.57721  -0.724    0.471    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.7 on 95 degrees of freedom
## Multiple R-squared:  0.03842,    Adjusted R-squared:  -0.002062 
## F-statistic: 0.9491 on 4 and 95 DF,  p-value: 0.4393

Output Result:

Just because p-value (0.4393) result is higher than 0.05, so we can’t define the regression model.

# Load required libraries
library(ggplot2)
library(plotly)  # Ensure plotly is loaded
library(gridExtra)
library(cowplot)

# Predict sales revenue using the regression model
simulated_data$Predicted_Sales <- predict(reg_model)

# Create Residual vs Fitted plot using ggplot2
residual_plot <- ggplot(simulated_data, aes(x = Predicted_Sales, y = resid(reg_model))) +
  geom_point(color = "green") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted") +
  theme_minimal()

# Convert to plotly
residual_plotly <- ggplotly(residual_plot)

# Create Distribution of Residuals plot using ggplot2
residual_distribution <- ggplot(simulated_data, aes(x = resid(reg_model))) +
  geom_histogram(binwidth = 100, fill = "orange", color = "yellow") +
  labs(x = "Residuals", y = "Frequency", title = "Distribution of Residuals") +
  theme_minimal()

# Convert to plotly
residual_distribution_plotly <- ggplotly(residual_distribution)

# Create QQ Plot using ggplot2
qq_plot <- ggplot(simulated_data, aes(sample = resid(reg_model))) +
  stat_qq(color = "red") +
  stat_qq_line(color = "blue") +
  labs(title = "QQ Plot of Residuals") +
  theme_minimal()

# Convert to plotly
qq_plot_plotly <- ggplotly(qq_plot)

# Display all plots (One way to do it is to use subplot from plotly to put them together)
final_plot <- subplot(residual_plotly, residual_distribution_plotly, qq_plot_plotly, 
                      nrows = 2, margin = 0.05)
final_plot

Question 2

Build a time series model to capture the temporal patterns and trends in sales revenue, accounting for seasonality and other time-related effects.

# Load required libraries
library(plotly)

# Membuat objek ts untuk analisis deret waktu
sales_ts <- ts(simulated_data$Sales_Revenue, frequency = 365)

# Create a dataframe from the ts object for easier use with plotly
ts_data <- data.frame(Time = 1:length(sales_ts), Sales_Revenue = as.numeric(sales_ts))

# Create an interactive plotly plot
interactive_ts_plot <- plot_ly(data = ts_data, x = ~Time, y = ~Sales_Revenue, type = 'scatter', mode = 'lines',
                               line = list(color = 'blue')) %>%
  layout(title = "Time Series of Sales Revenue",
         xaxis = list(title = "Time"),
         yaxis = list(title = "Sales Revenue"))

# Display the plot
interactive_ts_plot

# Memuat library forecast untuk pemodelan ARIMA
library(forecast)

# Menggunakan auto.arima untuk menemukan model terbaik
arima_model <- auto.arima(sales_ts)
summary(arima_model)

## Series: sales_ts 
## ARIMA(1,0,1) with non-zero mean 
## 
## Coefficients:
##          ar1      ma1      mean
##       0.9198  -0.7377  1001.994
## s.e.  0.0535   0.0803    36.507
## 
## sigma^2 = 15050:  log likelihood = -621.52
## AIC=1251.04   AICc=1251.47   BIC=1261.47
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE MASE       ACF1
## Training set -1.929319 120.8257 94.75599 -1.80019 9.983429  NaN -0.0317043

Output Result:

The output shows that the ARIMA(1,0,1) model successfully captures most of the information in the data, with some errors remaining. The evaluation metrics provide an idea of where the model may need to be improved or calibrated.

Question 3

Evaluate and compare the performance of both models in forecasting future sales revenue.

# Generate forecasts for both models
arima_forecast <- forecast(arima_model, h = 30)  # Forecast for the next 30 periods

# Calculate evaluation metrics for ARIMA model
arima_mse <- mean((as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue)^2)
arima_mae <- mean(abs(as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue))
arima_mape <- mean(abs((as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue) / simulated_data$Sales_Revenue)) * 100

# Calculate evaluation metrics for linear regression model
linear_reg_mse <- mean((rep(simulated_data$Sales_Revenue[length(simulated_data$Sales_Revenue)], 30) - simulated_data$Sales_Revenue[length(simulated_data$Sales_Revenue)])^2)
linear_reg_mae <- mean(abs(rep(simulated_data$Sales_Revenue[length(simulated_data$Sales_Revenue)], 30) - simulated_data$Sales_Revenue))
linear_reg_mape <- mean(abs((rep(simulated_data$Sales_Revenue[length(simulated_data$Sales_Revenue)], 30) - simulated_data$Sales_Revenue) / simulated_data$Sales_Revenue)) * 100

# Print evaluation metrics
cat("Evaluation metrics for ARIMA model:\n")

## Evaluation metrics for ARIMA model:

cat("MSE:", arima_mse, "\n")

## MSE: 18802.16

cat("MAE:", arima_mae, "\n")

## MAE: 109.4466

cat("MAPE:", arima_mape, "%\n\n")

## MAPE: 11.29185 %

cat("Evaluation metrics for Linear Regression model:\n")

## Evaluation metrics for Linear Regression model:

cat("MSE:", linear_reg_mse, "\n")

## MSE: 0

cat("MAE:", linear_reg_mae, "\n")

## MAE: 218.977

cat("MAPE:", linear_reg_mape, "%\n")

## MAPE: 20.61591 %

Output Result:

ARIMA Model:

** MSE (Mean Squared Error): An MSE of 18802.16 suggests significant variation between the predicted values and the actual values in the data predicted by the ARIMA model. This indicates that the model might need further adjustments or a different approach to reduce prediction errors.

** MAE (Mean Absolute Error): An MAE of 109.4466 indicates that, on average, the predictions from the ARIMA model deviate by approximately 109.45 units from the actual values. Depending on the scale and distribution of sales data, this could be considered an acceptable performance indicator.

** MAPE (Mean Absolute Percentage Error): A MAPE of 11.29% indicates that, on average, predictions from the ARIMA model deviate about 11.29% from the actual values, providing a perspective on the relative accuracy of the predictions.

Linear Regression Model:

** MSE (Mean Squared Error): An MSE of 0 in the linear regression model is highly unusual unless all predictions are perfect, or there is an error in computation or reporting. If this is the correct value, it would imply that the linear regression model is perfect, but this should be verified as there might be an error.

** MAE (Mean Absolute Error): This high MAE, 218.977 indicates that, on average, the absolute errors in the predictions of the linear regression model are quite large. This suggests that the regression model might not be capturing all relevant information or there is significant unexplained variability.

** MAPE (Mean Absolute Percentage Error): This higher MAPE, 20.61591% indicates that the linear regression model has lower accuracy compared to the ARIMA model. A MAPE above 20% can be considered inaccurate in many application contexts, particularly in sales and financial forecasting.

General Conclusion

The ARIMA model appears to perform better in terms of prediction accuracy compared to the linear regression model, based on the metrics you have provided.
The lower MAE and MAPE from the ARIMA model indicate that it is able to predict with relatively smaller and more consistent errors compared to the linear regression model.
Re-verify the MSE value of 0 for the linear regression model, as this may indicate an error or an anomaly in the data or modeling process.