Task 3 | Regression vs Time Series
Ekonometrika
| Kontak | : \(\downarrow\) |
| mugemisausan05@gmail.com | |
| https://www.instagram.com/saram.05/ | |
| RPubs | https://rpubs.com/sausanramadhani/ |
In this practicum case study, we will analyze sales data from a retail store and develop forecasting models using both regression and time series approaches. The goal is to predict future sales revenue accurately, allowing the store to optimize inventory management and resource allocation.
Dataset
The dataset consists of historical sales data collected over the past two years, including information on :
- Date: Date of Sale
- Promotional_Spending: Amount of Promotional Spending
- Price: Product Price
- Weather_Condition: Weather Condition
- Sales_Revenue : Total Sales Revenue
# Load required libraries
library(tidyverse)
library(lubridate)
# Set seed for reproducibility
set.seed(123)
# Number of observations
n <- 100
# Simulate date range
start_date <- ymd("2022-01-01")
end_date <- ymd("2022-04-10") # Update end date to have 100 days
dates <- seq(start_date, end_date, by = "day")
# Simulate predictor variables
promotional_spending <- runif(n, min = 1000, max = 5000)
price <- rnorm(n, mean = 50, sd = 10)
weather_conditions <- sample(c("sunny", "cloudy", "rainy"), size = n, replace = TRUE)
# Simulate sales revenue
sales_trend <- 0.1 * seq(1, n)
seasonal_pattern <- sin(seq(1, n) * 2 * pi / 365 * 7) * 100
sales_noise <- rnorm(n, mean = 0, sd = 100)
sales_revenue <- 1000 + sales_trend + seasonal_pattern + sales_noise
# Create dataframe
simulated_data <- tibble(
Date = dates,
Promotional_Spending = promotional_spending,
Price = price,
Weather_Conditions = weather_conditions,
Sales_Revenue = sales_revenue
)
# Display the first few rows of the dataset
head(simulated_data)Develop a Regression Model
Develop a regression model to understand the relationship between sales revenue and various predictors such as promotional spending, pricing, and external factors.
# Load required library
library(tidyverse)
# Fit linear regression model
model <- lm(Sales_Revenue ~ Promotional_Spending + Price + Weather_Conditions, data = simulated_data)
# Summarize the model
summary(model)##
## Call:
## lm(formula = Sales_Revenue ~ Promotional_Spending + Price + Weather_Conditions,
## data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -340.29 -71.58 11.21 82.72 315.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 992.92772 81.16696 12.233 <2e-16 ***
## Promotional_Spending 0.01922 0.01197 1.605 0.112
## Price -0.64144 1.42687 -0.450 0.654
## Weather_Conditionsrainy -22.74556 36.17645 -0.629 0.531
## Weather_Conditionssunny -23.58060 32.57721 -0.724 0.471
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.7 on 95 degrees of freedom
## Multiple R-squared: 0.03842, Adjusted R-squared: -0.002062
## F-statistic: 0.9491 on 4 and 95 DF, p-value: 0.4393
The regression model using linear model (lm) does not
produce anything significant, in short, it can be seen from the p-value
(0.4393) which is greater than alpha (0.05).
# Load required libraries
library(ggplot2)
library(plotly)
# Predict sales revenue using the model
simulated_data$Predicted_Sales <- predict(model)
# Residual vs Fitted plot
residual_plot <- ggplot(simulated_data, aes(x = Predicted_Sales, y = resid(model))) +
geom_point(color = "#77B0AA") +
geom_hline(yintercept = 0, linetype = "dashed", color = "#8B322C") +
labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted") +
theme_minimal()
# Distribution of Residuals plot
residual_distribution <- ggplot(simulated_data, aes(x = resid(model))) +
geom_histogram(binwidth = 100, fill = "#436850", color = "#12372A") +
labs(x = "Residuals", y = "Frequency", title = "Distribution of Residuals") +
theme_minimal()
# QQ Plot
qq_plot <- ggplot(simulated_data, aes(sample = resid(model))) +
stat_qq(color = "#CAA6A6") +
stat_qq_line(color = "#9B4444") +
labs(title = "QQ Plot of Residuals") +
theme_minimal()
# Convert ggplot objects to plotly objects
residual_plot <- ggplotly(residual_plot)
residual_distribution <- ggplotly(residual_distribution)
qq_plot <- ggplotly(qq_plot)
# Arrange plots
subplot(residual_plot, residual_distribution, qq_plot, nrows = 2)While this model does not provide strong information on how promotional spending, prices, and weather conditions affect sales revenue, it helps us see more clearly the patterns in the data.
- Residual vs Fitted : The data is homogeneous because it is stationary with respect to variance.
- Distribution of Residuals : Normally distributed
- QQ Plot of Residuals : Data is linear
Build a Time Series
Build a time series model to capture the temporal patterns and trends in sales revenue, accounting for seasonality and other time-related effects.
# Load required libraries
library(forecast)
library(plotly)
# Convert Date column to Date format if not already
simulated_data$Date <- as.Date(simulated_data$Date)
# Set Date column as time series
sales_ts <- ts(simulated_data$Sales_Revenue, frequency = 7) # Assuming weekly seasonality
# Fit ARIMA model
arima_model <- auto.arima(sales_ts)
# Forecast data
forecast_data <- forecast(arima_model)
forecast_data## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 15.28571 957.2805 800.0598 1114.501 716.8322 1197.729
## 15.42857 960.8652 801.0584 1120.672 716.4618 1205.269
## 15.57143 964.1626 802.1999 1126.125 716.4620 1211.863
## 15.71429 967.1956 803.4310 1130.960 716.7392 1217.652
## 15.85714 969.9855 804.7116 1135.259 717.2209 1222.750
## 16.00000 972.5516 806.0116 1139.092 717.8506 1227.253
## 16.14286 974.9121 807.3081 1142.516 718.5840 1231.240
## 16.28571 977.0833 808.5845 1145.582 719.3866 1234.780
## 16.42857 979.0804 809.8282 1148.333 720.2314 1237.929
## 16.57143 980.9175 811.0303 1150.805 721.0975 1240.737
## 16.71429 982.6072 812.1847 1153.030 721.9685 1243.246
## 16.85714 984.1615 813.2874 1155.036 722.8321 1245.491
## 17.00000 985.5912 814.3359 1156.846 723.6789 1247.504
## 17.14286 986.9062 815.3291 1158.483 724.5017 1249.311
Above is an ARIMA model prediction, where each row represents one time period.
- Point Forecast : a forecast or estimated value for sales revenue in the time period
- Lo 80 : lower bound of the 80% confidence interval for prediction
- Hi 80 : upper limit of the 80% confidence interval for prediction
- Lo 95 : lower bound of the 95% confidence interval for prediction
- Hi 95 : upper limit of the 95% confidence interval for prediction
# Actual vs. Fitted plot
actual_vs_fitted <- plot_ly(x = simulated_data$Date) %>%
add_lines(y = sales_ts, name = "Actual") %>%
add_lines(y = fitted(arima_model), name = "Fitted") %>%
layout(title = "Actual vs Fitted", xaxis = list(title = "Date"), yaxis = list(title = "Sales Revenue"))
# Residuals plot
residuals_plot <- plot_ly(x = 1:length(residuals(arima_model)), y = residuals(arima_model), type = "scatter", mode = "markers", marker = list(color = "#496989")) %>%
add_trace(x = 1:length(residuals(arima_model)), y = rep(0, length(residuals(arima_model))), mode = "lines", line = list(dash = "dash", color = "#561C24"), name = "Zero Line") %>%
layout(title = "Residuals", xaxis = list(title = "Index"), yaxis = list(title = "Residuals"))
# Arrange plots
subplot(actual_vs_fitted, residuals_plot, nrows = 2)Above are two plots made into one two-line view by using
subplot():
- Plot Aktual vs Yang diprediksi : An interactive plot that compares the actual data with the data predicted by the ARIMA model.
- Plot Residu : An interactive plot that displays the residuals of an ARIMA model. The residual is the difference between the actual value and the value predicted by the model. Based on the plot, it can be seen that the variance is normal.
Evaluate and Compare
Evaluate and compare the performance of both models in forecasting future sales revenue.
# Generate forecasts for both models
arima_forecast <- forecast(arima_model, h = 30) # Forecast for the next 30 periods
# Calculate evaluation metrics for ARIMA model
arima_mse <- mean((as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue)^2)
arima_mae <- mean(abs(as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue))
arima_mape <- mean(abs((as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue) / simulated_data$Sales_Revenue)) * 100
# Calculate evaluation metrics for linear regression model
linear_reg_mse <- mean((rep(model$Sales_Revenue[length(model$Sales_Revenue)], 30) - simulated_data$Sales_Revenue[length(simulated_data$Sales_Revenue)])^2)
linear_reg_mae <- mean(abs(rep(model$Sales_Revenue[length(model$Sales_Revenue)], 30) - simulated_data$Sales_Revenue))
linear_reg_mape <- mean(abs((rep(model$Sales_Revenue[length(model$Sales_Revenue)], 30) - simulated_data$Sales_Revenue) / simulated_data$Sales_Revenue)) * 100
# Print evaluation metrics
cat("Evaluation metrics for ARIMA model:\n")## Evaluation metrics for ARIMA model:
## MSE: 18802.16
## MAE: 109.4466
## MAPE: 11.29185 %
## Evaluation metrics for Linear Regression model:
## MSE: NaN
## MAE: NaN
## MAPE: NaN %