| Kontak | \(\downarrow\) |
| naftaligunawan@gmail.com | |
| https://www.instagram.com/nbrigittag/ | |
| RPubs | https://rpubs.com/naftalibrigitta/ |
| Nama | Naftali Brigitta Gunawan |
| NIM | 20214920002 |
Dataset
The dataset consists of historical sales data collected over the past
two years, including information on sales revenue, promotional
activities, pricing strategies, weather conditions, and holidays.
The dataset consists of historical sales data collected over the past
two years, including information on :
Date: Date of Sale
Promotional_Spending: Amount of Promotional Spending
Price: Product Price
Weather_Condition: Weather Condition
Sales_Revenue : Total Sales Revenue
# Load required libraries
library(tidyverse)
library(lubridate)
# Set seed for reproducibility
set.seed(123)
# Number of observations
n <- 100
# Simulate date range
start_date <- ymd("2022-01-01")
end_date <- ymd("2022-04-10")
dates <- seq(start_date, end_date, by = "day")
# Simulate predictor variables
promotional_spending <- runif(n, min = 1000, max = 5000)
price <- rnorm(n, mean = 50, sd = 10)
weather_conditions <- sample(c("sunny", "cloudy", "rainy"), size = n, replace = TRUE)
# Simulate sales revenue
sales_trend <- 0.1 * seq(1, n)
seasonal_pattern <- sin(seq(1, n) * 2 * pi / 365 * 7) * 100
sales_noise <- rnorm(n, mean = 0, sd = 100)
sales_revenue <- 1000 + sales_trend + seasonal_pattern + sales_noise
# Create dataframe
simulated_data <- tibble(
Date = dates,
Promotional_Spending = promotional_spending,
Price = price,
Weather_Conditions = weather_conditions,
Sales_Revenue = sales_revenue
)
# Display the first few rows of the dataset
simulated_data
Question 1
Develop a regression model to understand the relationship between sales revenue and various predictors such as promotional spending, pricing, and external factors.
library(dplyr)
sum(is.na(simulated_data))
## [1] 0
There’s not found NULL data in this data.
# Data ready to simulated
simulated_data <- simulated_data %>%
mutate(Weather_Conditions = as.factor(Weather_Conditions))
# Using lm() for linear model regression
reg_model <- lm(Sales_Revenue ~ Promotional_Spending + Price + Weather_Conditions, data = simulated_data)
summary(reg_model)
##
## Call:
## lm(formula = Sales_Revenue ~ Promotional_Spending + Price + Weather_Conditions,
## data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -340.29 -71.58 11.21 82.72 315.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 992.92772 81.16696 12.233 <2e-16 ***
## Promotional_Spending 0.01922 0.01197 1.605 0.112
## Price -0.64144 1.42687 -0.450 0.654
## Weather_Conditionsrainy -22.74556 36.17645 -0.629 0.531
## Weather_Conditionssunny -23.58060 32.57721 -0.724 0.471
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.7 on 95 degrees of freedom
## Multiple R-squared: 0.03842, Adjusted R-squared: -0.002062
## F-statistic: 0.9491 on 4 and 95 DF, p-value: 0.4393
Output Result:
Just because p-value (0.4393) result is higher than 0.05, so we can’t define the regression model.
# Load required libraries
library(ggplot2)
library(plotly) # Ensure plotly is loaded
library(gridExtra)
library(cowplot)
# Predict sales revenue using the regression model
simulated_data$Predicted_Sales <- predict(reg_model)
# Create Residual vs Fitted plot using ggplot2
residual_plot <- ggplot(simulated_data, aes(x = Predicted_Sales, y = resid(reg_model))) +
geom_point(color = "green") +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted") +
theme_minimal()
# Convert to plotly
residual_plotly <- ggplotly(residual_plot)
# Create Distribution of Residuals plot using ggplot2
residual_distribution <- ggplot(simulated_data, aes(x = resid(reg_model))) +
geom_histogram(binwidth = 100, fill = "orange", color = "yellow") +
labs(x = "Residuals", y = "Frequency", title = "Distribution of Residuals") +
theme_minimal()
# Convert to plotly
residual_distribution_plotly <- ggplotly(residual_distribution)
# Create QQ Plot using ggplot2
qq_plot <- ggplot(simulated_data, aes(sample = resid(reg_model))) +
stat_qq(color = "red") +
stat_qq_line(color = "blue") +
labs(title = "QQ Plot of Residuals") +
theme_minimal()
# Convert to plotly
qq_plot_plotly <- ggplotly(qq_plot)
# Display all plots (One way to do it is to use subplot from plotly to put them together)
final_plot <- subplot(residual_plotly, residual_distribution_plotly, qq_plot_plotly,
nrows = 2, margin = 0.05)
final_plot
Question 2
Build a time series model to capture the temporal patterns and trends in sales revenue, accounting for seasonality and other time-related effects.
# Load required libraries
library(plotly)
# Membuat objek ts untuk analisis deret waktu
sales_ts <- ts(simulated_data$Sales_Revenue, frequency = 365)
# Create a dataframe from the ts object for easier use with plotly
ts_data <- data.frame(Time = 1:length(sales_ts), Sales_Revenue = as.numeric(sales_ts))
# Create an interactive plotly plot
interactive_ts_plot <- plot_ly(data = ts_data, x = ~Time, y = ~Sales_Revenue, type = 'scatter', mode = 'lines',
line = list(color = 'blue')) %>%
layout(title = "Time Series of Sales Revenue",
xaxis = list(title = "Time"),
yaxis = list(title = "Sales Revenue"))
# Display the plot
interactive_ts_plot
# Memuat library forecast untuk pemodelan ARIMA
library(forecast)
# Menggunakan auto.arima untuk menemukan model terbaik
arima_model <- auto.arima(sales_ts)
summary(arima_model)
## Series: sales_ts
## ARIMA(1,0,1) with non-zero mean
##
## Coefficients:
## ar1 ma1 mean
## 0.9198 -0.7377 1001.994
## s.e. 0.0535 0.0803 36.507
##
## sigma^2 = 15050: log likelihood = -621.52
## AIC=1251.04 AICc=1251.47 BIC=1261.47
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -1.929319 120.8257 94.75599 -1.80019 9.983429 NaN -0.0317043
Output Result:
The output shows that the ARIMA(1,0,1) model successfully captures most of the information in the data, with some errors remaining. The evaluation metrics provide an idea of where the model may need to be improved or calibrated.
Question 3
Evaluate and compare the performance of both models in forecasting future sales revenue.
# Generate forecasts for both models
arima_forecast <- forecast(arima_model, h = 30) # Forecast for the next 30 periods
# Calculate evaluation metrics for ARIMA model
arima_mse <- mean((as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue)^2)
arima_mae <- mean(abs(as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue))
arima_mape <- mean(abs((as.numeric(arima_forecast$mean) - simulated_data$Sales_Revenue) / simulated_data$Sales_Revenue)) * 100
# Calculate evaluation metrics for linear regression model
linear_reg_mse <- mean((rep(simulated_data$Sales_Revenue[length(simulated_data$Sales_Revenue)], 30) - simulated_data$Sales_Revenue[length(simulated_data$Sales_Revenue)])^2)
linear_reg_mae <- mean(abs(rep(simulated_data$Sales_Revenue[length(simulated_data$Sales_Revenue)], 30) - simulated_data$Sales_Revenue))
linear_reg_mape <- mean(abs((rep(simulated_data$Sales_Revenue[length(simulated_data$Sales_Revenue)], 30) - simulated_data$Sales_Revenue) / simulated_data$Sales_Revenue)) * 100
# Print evaluation metrics
cat("Evaluation metrics for ARIMA model:\n")
## Evaluation metrics for ARIMA model:
cat("MSE:", arima_mse, "\n")
## MSE: 18802.16
cat("MAE:", arima_mae, "\n")
## MAE: 109.4466
cat("MAPE:", arima_mape, "%\n\n")
## MAPE: 11.29185 %
cat("Evaluation metrics for Linear Regression model:\n")
## Evaluation metrics for Linear Regression model:
cat("MSE:", linear_reg_mse, "\n")
## MSE: 0
cat("MAE:", linear_reg_mae, "\n")
## MAE: 218.977
cat("MAPE:", linear_reg_mape, "%\n")
## MAPE: 20.61591 %
Output Result:
- ARIMA Model:
** MSE (Mean Squared Error): An MSE of
18802.16 suggests significant variation between the
predicted values and the actual values in the data predicted by the
ARIMA model. This indicates that the model might need further
adjustments or a different approach to reduce prediction errors.
** MAE (Mean Absolute Error): An MAE of
109.4466 indicates that, on average, the predictions from
the ARIMA model deviate by approximately 109.45 units from the actual
values. Depending on the scale and distribution of sales data, this
could be considered an acceptable performance indicator.
** MAPE (Mean Absolute Percentage Error): A MAPE of
11.29% indicates that, on average, predictions from the
ARIMA model deviate about 11.29% from the actual values, providing a
perspective on the relative accuracy of the predictions.
- Linear Regression Model:
** MSE (Mean Squared Error): An MSE of 0 in
the linear regression model is highly unusual unless all predictions are
perfect, or there is an error in computation or reporting. If this is
the correct value, it would imply that the linear regression model is
perfect, but this should be verified as there might be an error.
** MAE (Mean Absolute Error): This high MAE,
218.977 indicates that, on average, the absolute errors in
the predictions of the linear regression model are quite large. This
suggests that the regression model might not be capturing all relevant
information or there is significant unexplained variability.
** MAPE (Mean Absolute Percentage Error): This higher
MAPE, 20.61591% indicates that the linear
regression model has lower accuracy compared to the ARIMA model. A MAPE
above 20% can be considered inaccurate in many application contexts,
particularly in sales and financial forecasting.
General Conclusion
The
ARIMA modelappears toperform betterin terms of prediction accuracy compared to the linear regression model, based on the metrics you have provided.The
lower MAE and MAPEfrom theARIMA modelindicate that itis able to predict with relatively smallerand more consistent errors compared to the linear regression model.Re-verify the MSE value of 0 for the linear regression model, as this may indicate an error or an anomaly in the data or modeling process.