Task 3 | Regression vs Time Series

Ekonometrika


Kontak : \(\downarrow\)
Email
Instagram https://www.instagram.com/saram.05/
RPubs https://rpubs.com/sausanramadhani/

In this practicum case study, we will analyze sales data from a retail store and develop forecasting models using both regression and time series approaches. The goal is to predict future sales revenue accurately, allowing the store to optimize inventory management and resource allocation.

Dataset

The dataset consists of historical sales data collected over the past two years, including information on :

  • Date: Date of Sale
  • Promotional_Spending: Amount of Promotional Spending
  • Price: Product Price
  • Weather_Condition: Weather Condition
  • Sales_Revenue : Total Sales Revenue
# Load required libraries
library(tidyverse)
library(lubridate)

# Set seed for reproducibility
set.seed(123)

# Number of observations
n <- 100

# Simulate date range
start_date <- ymd("2022-01-01")
end_date <- ymd("2022-04-10")  # Update end date to have 100 days
dates <- seq(start_date, end_date, by = "day")

# Simulate predictor variables
promotional_spending <- runif(n, min = 1000, max = 5000)
price <- rnorm(n, mean = 50, sd = 10)
weather_conditions <- sample(c("sunny", "cloudy", "rainy"), size = n, replace = TRUE)

# Simulate sales revenue
sales_trend <- 0.1 * seq(1, n)
seasonal_pattern <- sin(seq(1, n) * 2 * pi / 365 * 7) * 100
sales_noise <- rnorm(n, mean = 0, sd = 100)
sales_revenue <- 1000 + sales_trend + seasonal_pattern + sales_noise

# Create dataframe
simulated_data <- tibble(
  Date = dates,
  Promotional_Spending = promotional_spending,
  Price = price,
  Weather_Conditions = weather_conditions,
  Sales_Revenue = sales_revenue
)

# Display the first few rows of the dataset
head(simulated_data)

Develop a Regression Model

Develop a regression model to understand the relationship between sales revenue and various predictors such as promotional spending, pricing, and external factors.

# Load required library
library(tidyverse)

# Fit linear regression model
model <- lm(Sales_Revenue ~ Promotional_Spending + Price + Weather_Conditions, data = simulated_data)

# Summarize the model
summary(model)
## 
## Call:
## lm(formula = Sales_Revenue ~ Promotional_Spending + Price + Weather_Conditions, 
##     data = simulated_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -340.29  -71.58   11.21   82.72  315.61 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             992.92772   81.16696  12.233   <2e-16 ***
## Promotional_Spending      0.01922    0.01197   1.605    0.112    
## Price                    -0.64144    1.42687  -0.450    0.654    
## Weather_Conditionsrainy -22.74556   36.17645  -0.629    0.531    
## Weather_Conditionssunny -23.58060   32.57721  -0.724    0.471    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.7 on 95 degrees of freedom
## Multiple R-squared:  0.03842,    Adjusted R-squared:  -0.002062 
## F-statistic: 0.9491 on 4 and 95 DF,  p-value: 0.4393

The regression model using linear model (lm) does not produce anything significant, in short, it can be seen from the p-value (0.4393) which is greater than alpha (0.05).

# Load required libraries
library(ggplot2)
library(plotly)

# Predict sales revenue using the model
simulated_data$Predicted_Sales <- predict(model)

# Residual vs Fitted plot
residual_plot <- ggplot(simulated_data, aes(x = Predicted_Sales, y = resid(model))) +
  geom_point(color = "#77B0AA") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "#8B322C") +
  labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted") +
  theme_minimal()

# Distribution of Residuals plot
residual_distribution <- ggplot(simulated_data, aes(x = resid(model))) +
  geom_histogram(binwidth = 100, fill = "#436850", color = "#12372A") +
  labs(x = "Residuals", y = "Frequency", title = "Distribution of Residuals") +
  theme_minimal()

# QQ Plot
qq_plot <- ggplot(simulated_data, aes(sample = resid(model))) +
  stat_qq(color = "#CAA6A6") +
  stat_qq_line(color = "#9B4444") +
  labs(title = "QQ Plot of Residuals") +
  theme_minimal()

# Convert ggplot objects to plotly objects
residual_plot <- ggplotly(residual_plot)
residual_distribution <- ggplotly(residual_distribution)
qq_plot <- ggplotly(qq_plot)

# Arrange plots
subplot(residual_plot, residual_distribution, qq_plot, nrows = 2)

While this model does not provide strong information on how promotional spending, prices, and weather conditions affect sales revenue, it helps us see more clearly the patterns in the data.

  • Residual vs Fitted : The data is homogeneous because it is stationary with respect to variance.
  • Distribution of Residuals : Normally distributed
  • QQ Plot of Residuals : Data is linear

Build a Time Series

Build a time series model to capture the temporal patterns and trends in sales revenue, accounting for seasonality and other time-related effects.

# Load required libraries
library(forecast)
library(plotly)

# Convert Date column to Date format if not already
simulated_data$Date <- as.Date(simulated_data$Date)

# Set Date column as time series
sales_ts <- ts(simulated_data$Sales_Revenue, frequency = 7)  # Assuming weekly seasonality

# Fit ARIMA model
arima_model <- auto.arima(sales_ts)

# Forecast data
forecast_data <- forecast(arima_model)
forecast_data
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 15.28571       957.2805 800.0598 1114.501 716.8322 1197.729
## 15.42857       960.8652 801.0584 1120.672 716.4618 1205.269
## 15.57143       964.1626 802.1999 1126.125 716.4620 1211.863
## 15.71429       967.1956 803.4310 1130.960 716.7392 1217.652
## 15.85714       969.9855 804.7116 1135.259 717.2209 1222.750
## 16.00000       972.5516 806.0116 1139.092 717.8506 1227.253
## 16.14286       974.9121 807.3081 1142.516 718.5840 1231.240
## 16.28571       977.0833 808.5845 1145.582 719.3866 1234.780
## 16.42857       979.0804 809.8282 1148.333 720.2314 1237.929
## 16.57143       980.9175 811.0303 1150.805 721.0975 1240.737
## 16.71429       982.6072 812.1847 1153.030 721.9685 1243.246
## 16.85714       984.1615 813.2874 1155.036 722.8321 1245.491
## 17.00000       985.5912 814.3359 1156.846 723.6789 1247.504
## 17.14286       986.9062 815.3291 1158.483 724.5017 1249.311

Above is an ARIMA model prediction, where each row represents one time period.

  • Point Forecast : a forecast or estimated value for sales revenue in the time period
  • Lo 80 : lower bound of the 80% confidence interval for prediction
  • Hi 80 : upper limit of the 80% confidence interval for prediction
  • Lo 95 : lower bound of the 95% confidence interval for prediction
  • Hi 95 : upper limit of the 95% confidence interval for prediction
# Actual vs. Fitted plot
actual_vs_fitted <- plot_ly(x = simulated_data$Date) %>%
  add_lines(y = sales_ts, name = "Actual") %>%
  add_lines(y = fitted(arima_model), name = "Fitted") %>%
  layout(title = "Actual vs Fitted", xaxis = list(title = "Date"), yaxis = list(title = "Sales Revenue"))

# Residuals plot
residuals_plot <- plot_ly(x = 1:length(residuals(arima_model)), y = residuals(arima_model), type = "scatter", mode = "markers", marker = list(color = "#496989")) %>%
  add_trace(x = 1:length(residuals(arima_model)), y = rep(0, length(residuals(arima_model))), mode = "lines", line = list(dash = "dash", color = "#561C24"), name = "Zero Line") %>%
  layout(title = "Residuals", xaxis = list(title = "Index"), yaxis = list(title = "Residuals"))

# Arrange plots
subplot(actual_vs_fitted, residuals_plot, nrows = 2)

Above are two plots made into one two-line view by using subplot():

  1. Plot Aktual vs Yang diprediksi : An interactive plot that compares the actual data with the data predicted by the ARIMA model.
  2. Plot Residu : An interactive plot that displays the residuals of an ARIMA model. The residual is the difference between the actual value and the value predicted by the model. Based on the plot, it can be seen that the variance is normal.

Evaluate and Compare

Evaluate and compare the performance of both models in forecasting future sales revenue.

# Generate forecasts for both models
arima_forecast <- forecast(arima_model, h = 30)  # Forecast for the next 30 periods

# Calculate evaluation metrics for ARIMA model
arima_mse <- mean((as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue)^2)
arima_mae <- mean(abs(as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue))
arima_mape <- mean(abs((as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue) / simulated_data$Sales_Revenue)) * 100

# Calculate evaluation metrics for linear regression model
linear_reg_mse <- mean((rep(model$Sales_Revenue[length(model$Sales_Revenue)], 30) - simulated_data$Sales_Revenue[length(simulated_data$Sales_Revenue)])^2)
linear_reg_mae <- mean(abs(rep(model$Sales_Revenue[length(model$Sales_Revenue)], 30) - simulated_data$Sales_Revenue))
linear_reg_mape <- mean(abs((rep(model$Sales_Revenue[length(model$Sales_Revenue)], 30) - simulated_data$Sales_Revenue) / simulated_data$Sales_Revenue)) * 100

# Print evaluation metrics
cat("Evaluation metrics for ARIMA model:\n")
## Evaluation metrics for ARIMA model:
cat("MSE:", arima_mse, "\n")
## MSE: 18802.16
cat("MAE:", arima_mae, "\n")
## MAE: 109.4466
cat("MAPE:", arima_mape, "%\n\n")
## MAPE: 11.29185 %
cat("Evaluation metrics for Linear Regression model:\n")
## Evaluation metrics for Linear Regression model:
cat("MSE:", linear_reg_mse, "\n")
## MSE: NaN
cat("MAE:", linear_reg_mae, "\n")
## MAE: NaN
cat("MAPE:", linear_reg_mape, "%\n")
## MAPE: NaN %