The goal of this analysis is to develop predictive models for estimating Weeks of Supply (WOS) based on Average Sales data. Multiple statistical models, including linear regression, transformations, and machine learning models like Random Forest and CART (Classification and Regression Trees), are employed to determine the best predictive approach. Additionally, model diagnostics and evaluation metrics are used to assess the performance of each model.
Loading Libraries and Data The required
libraries such as tidyverse, readxl,
dlookr, and others are loaded for data manipulation,
visualization, and modeling. The dataset is loaded from an Excel file
and cleaned by removing unnecessary columns such as
Slope and Intercept, text-based
columns, and those starting with "..." (possibly temporary
columns).
Initial Data Exploration Basic visualizations and diagnostics are performed:
Correlation Plot: Investigating the correlation between variables.
Normality Check: Testing whether the data is normally distributed using visual tools and statistical tests like Shapiro-Wilk.
Outlier Detection: Identifying and diagnosing
potential outliers in the dataset using plot_outlier() and
diagnose_outlier().
Simple Linear Regression (SLR) A simple linear regression model is created to understand the relationship between Average Sales and Weeks of Supply (WOS). Key steps include:
Plotting the relationship between Average Sales and WOS.
Performing a linear regression analysis.
Evaluating model assumptions through diagnostics (Breusch-Pagan test, Goldfeld-Quandt test, etc.).
Checking the model summary and residuals for accuracy.
Variable Transformation for Improved Fit Transformation techniques are applied to improve the linearity and normality of the data:
First Transformation Model: Log-transforming WOS and applying the inverse of Average Sales to linearize the relationship.
Second Transformation Model: Log-transforming WOS and taking the square root of Average Sales for better results.
Each transformation is followed by Shapiro-Wilk tests for normality, model fitting using linear regression, and residual diagnostics to ensure assumptions are met.
Handling Outliers Outliers are filtered based on the interquartile range (IQR) method. Data points outside 1.5 times the IQR are excluded from the analysis. This reduces noise and enhances model performance.
Random Forest Model A Random Forest model is trained to predict Weeks of Supply (WOS) from Average Sales. The model’s predictive power is evaluated and residual analysis is conducted. Random Forest is a robust ensemble learning method that reduces overfitting and improves accuracy compared to simple linear models.
CART (Classification and Regression Tree) A
CART model is developed as another alternative to
predict WOS. The tree structure of the model helps
capture non-linear relationships between the variables.
CART model results are visualized using
rpart.plot(), and predictions are compared to actual
values.
Model Evaluations Several models are evaluated using custom functions that calculate:
SSE (Sum of Squared Errors)
MSE (Mean Squared Error)
RMSE (Root Mean Squared Error)
R-Squared (Coefficient of Determination)
This function is applied to all models, including:
The initial linear regression model.
The transformed models.
The Random Forest and CART models.
These metrics are consolidated to determine which model performs best in predicting Weeks of Supply (WOS).
library(tidyverse)
library(readxl)
library(dlookr)
library(caTools)
library(lmtest)
library(rpart)
library(rpart.plot)
library(randomForest)
set.seed(123)data <- read_excel("C:/Users/DELL/Desktop/Kolabtree_Projects/Week of supply prediction/Retails prediction dataset..xlsx")
data%>%head(10) ggplot(data, aes(x = `Avg Sales`, y = WOS)) + geom_point() + geom_smooth(method = "lm", color = "blue") + ggtitle("Average Sales vs. Weeks of Supply (WOS)") + theme( plot.title = element_text(hjust = 0.5, face = "bold"),axis.title.x = element_text(face = "bold"),axis.title.y = element_text(face = "bold"),axis.text.x = element_text(face = "bold"),axis.text.y = element_text(face = "bold")) + labs(x = "Average Sales", y = "Weeks of Supply (WOS)")##
## Pearson's product-moment correlation
##
## data: data$WOS and data$`Avg Sales`
## t = -4.3774, df = 43, p-value = 7.544e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7298009 -0.3126169
## sample estimates:
## cor
## -0.5552033
##
## Shapiro-Wilk normality test
##
## data: data$WOS
## W = 0.63144, p-value = 2.169e-09
##
## Shapiro-Wilk normality test
##
## data: data$`Avg Sales`
## W = 0.91556, p-value = 0.002999
##
## Call:
## lm(formula = WOS ~ `Avg Sales`, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.767 -10.267 -3.142 4.671 80.482
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.83030 6.08862 6.706 3.43e-08 ***
## `Avg Sales` -0.31251 0.07139 -4.377 7.54e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.63 on 43 degrees of freedom
## Multiple R-squared: 0.3083, Adjusted R-squared: 0.2922
## F-statistic: 19.16 on 1 and 43 DF, p-value: 7.544e-05
predite_lm <- predict(model, data)
compare_lm<- data.frame(predicted =predite_lm, Observed=data$WOS, residual=residuals.lm(model, type="deviance"))
compare_lm%>%head(10)##
## studentized Breusch-Pagan test
##
## data: model
## BP = 3.1253, df = 1, p-value = 0.07708
##
## Goldfeld-Quandt test
##
## data: model
## GQ = 91.219, df1 = 21, df2 = 20, p-value = 1.667e-15
## alternative hypothesis: variance increases from segment 1 to 2
##
## Harrison-McCabe test
##
## data: model
## HMC = 0.25463, p-value = 0.011
##
## Shapiro-Wilk normality test
##
## data: dat_t$lg_wos
## W = 0.9799, p-value = 0.6159
##
## Shapiro-Wilk normality test
##
## data: dat_t$inv
## W = 0.75462, p-value = 2.74e-07
##
## Call:
## lm(formula = lg_wos ~ inv, data = dat_t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.26196 -0.46450 -0.01392 0.58026 1.00151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1901 0.1837 6.479 7.35e-08 ***
## inv 74.9665 10.1494 7.386 3.56e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6001 on 43 degrees of freedom
## Multiple R-squared: 0.5592, Adjusted R-squared: 0.549
## F-statistic: 54.56 on 1 and 43 DF, p-value: 3.565e-09
pp1 <- predict(dat_1_model, newdata = dat_t)
combined_dat_1 <- data.frame(predicted = pp1, Observed=dat_t$lg_wos, Residual=residuals.lm(dat_1_model, type="deviance"))
combined_dat_1%>%head(10) ##
## studentized Breusch-Pagan test
##
## data: dat_1_model
## BP = 0.076525, df = 1, p-value = 0.7821
##
## Goldfeld-Quandt test
##
## data: dat_1_model
## GQ = 0.99719, df1 = 21, df2 = 20, p-value = 0.504
## alternative hypothesis: variance increases from segment 1 to 2
##
## Harrison-McCabe test
##
## data: dat_1_model
## HMC = 0.44123, p-value = 0.315
ggplot(data, aes(sqrt(`Avg Sales`), log(WOS)))+geom_point()+ geom_smooth(method = "lm")+ ggtitle("Square root of Average Sales vs. Log of Weeks of Supply (WOS)") + theme( plot.title = element_text(hjust = 0.5, face = "bold"),axis.title.x = element_text(face = "bold"),axis.title.y = element_text(face = "bold"),axis.text.x = element_text(face = "bold"),axis.text.y = element_text(face = "bold")) + labs(x = "Average Sales", y = "Weeks of Supply (WOS)")##
## Shapiro-Wilk normality test
##
## data: dat_t_2$lg_wos
## W = 0.9799, p-value = 0.6159
##
## Shapiro-Wilk normality test
##
## data: dat_t_2$sqrAVs
## W = 0.97572, p-value = 0.4579
##
## Call:
## lm(formula = lg_wos ~ sqrAVs, data = dat_t_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2496 -0.5814 0.1145 0.4823 0.9632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.28749 0.43201 12.239 1.33e-15 ***
## sqrAVs -0.33787 0.04895 -6.902 1.78e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6225 on 43 degrees of freedom
## Multiple R-squared: 0.5256, Adjusted R-squared: 0.5146
## F-statistic: 47.64 on 1 and 43 DF, p-value: 1.782e-08
pp <- predict(dat_2_model, newdata = dat_t_2)
combined_dat_2 <- data.frame(predicted = pp, Observed=dat_t_2$lg_wos, residuals.lm(dat_2_model, type="deviance"))
combined_dat_2%>%head(10)##
## studentized Breusch-Pagan test
##
## data: dat_2_model
## BP = 9.3406e-05, df = 1, p-value = 0.9923
##
## Goldfeld-Quandt test
##
## data: dat_2_model
## GQ = 1.3512, df1 = 21, df2 = 20, p-value = 0.2523
## alternative hypothesis: variance increases from segment 1 to 2
##
## Harrison-McCabe test
##
## data: dat_2_model
## HMC = 0.52908, p-value = 0.657
Q1 <- quantile(dat_t$lg_wos, 0.25, na.rm = TRUE)
Q3 <- quantile(dat_t$lg_wos, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
filtered_data <- dat_t %>% filter(lg_wos < 4.75359)
max(dat_t$lg_wos) ## [1] 4.75359
Q1 <- quantile(filtered_data$inv, 0.25, na.rm = TRUE)
Q3 <- quantile(filtered_data$inv, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
filter <- filtered_data%>% filter(inv < 0.0272608)
plot_normality(filter)ggplot(filtered_data, aes(inv, lg_wos))+geom_point()+ geom_smooth(method = "lm")+ ggtitle("Square root of Average Sales vs. Log of Weeks of Supply (WOS)") + theme( plot.title = element_text(hjust = 0.5, face = "bold"),axis.title.x = element_text(face = "bold"),axis.title.y = element_text(face = "bold"),axis.text.x = element_text(face = "bold"),axis.text.y = element_text(face = "bold")) + labs(x = "Average Sales", y = "Weeks of Supply (WOS)")##
## Shapiro-Wilk normality test
##
## data: filtered_data$lg_wos
## W = 0.98167, p-value = 0.7005
##
## Shapiro-Wilk normality test
##
## data: filtered_data$inv
## W = 0.92776, p-value = 0.008733
##
## Call:
## lm(formula = lg_wos ~ inv, data = filtered_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17375 -0.52217 -0.03708 0.47026 1.02112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8649 0.2299 3.762 0.000517 ***
## inv 98.1918 14.3607 6.838 2.48e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.575 on 42 degrees of freedom
## Multiple R-squared: 0.5268, Adjusted R-squared: 0.5155
## F-statistic: 46.75 on 1 and 42 DF, p-value: 2.483e-08
p1 <- predict(Modl1, newdata = filtered_data)
y <- data.frame(predicted=p1, observed=filtered_data$lg_wos, residuals.lm(Modl1, type="deviance"))
y%>%head(10)##
## studentized Breusch-Pagan test
##
## data: Modl1
## BP = 2.2097, df = 1, p-value = 0.1371
##
## Goldfeld-Quandt test
##
## data: Modl1
## GQ = 1.0037, df1 = 20, df2 = 20, p-value = 0.4968
## alternative hypothesis: variance increases from segment 1 to 2
##
## Harrison-McCabe test
##
## data: Modl1
## HMC = 0.44756, p-value = 0.293
library(randomForest) # RANDOMFOREST MODEL
## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 45 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 45 -none- numeric
## importance 1 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 45 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
predit <- predict(ran, data)
com <- data.frame(predicted=predit, observed=data$WOS, residual = data$WOS-predit)
com%>%head(10) ## numeric(0)
## Call:
## rpart(formula = WOS ~ `Avg Sales`, data = data)
## n= 45
##
## CP nsplit rel error xerror xstd
## 1 0.54580769 0 1.0000000 1.0463020 0.5973021
## 2 0.02299183 1 0.4541923 0.5643311 0.3536721
## 3 0.01000000 2 0.4312005 0.5626342 0.3537326
##
## Variable importance
## Avg Sales
## 100
##
## Node number 1: 45 observations, complexity param=0.5458077
## mean=16.48889, MSE=382.2499
## left son=2 (38 obs) right son=3 (7 obs)
## Primary splits:
## Avg Sales < 46.5 to the right, improve=0.5458077, (0 missing)
##
## Node number 2: 38 observations, complexity param=0.02299183
## mean=10.28947, MSE=44.04778
## left son=4 (18 obs) right son=5 (20 obs)
## Primary splits:
## Avg Sales < 81 to the right, improve=0.2362793, (0 missing)
##
## Node number 3: 7 observations
## mean=50.14286, MSE=876.9796
##
## Node number 4: 18 observations
## mean=6.888889, MSE=17.98765
##
## Node number 5: 20 observations
## mean=13.35, MSE=47.7275
predit_cart <- predict(cart, data)
com_cart <- data.frame(predicted=predit_cart, observed=data$WOS, residual=data$WOS- predit_cart)
com_cart%>%head(10)calculate_metrics <- function(actual, predicted)
{ residuals <- actual - predicted
sse <- sum(residuals^2)
mse <- mean(residuals^2)
rmse <- sqrt(mse)
r_squared <- 1 - (sum(residuals^2) / sum((actual - mean(actual))^2))
list(SSE = sse, MSE = mse, RMSE = rmse, R_squared = r_squared) }metrics_rF <- calculate_metrics(com$observed, com$predicted)
metrics_rFiltred <- calculate_metrics(y$observed, y$predicted)
metrics_TM2 <- calculate_metrics(combined_dat_2$Observed, combined_dat_2$predicted)
metrics_TM1 <- calculate_metrics(combined_dat_1$Observed, combined_dat_1$predicted)
metrics_lm <- calculate_metrics(compare_lm$Observed, compare_lm$predicted)## SSE MSE RMSE R_squared
## metrics_cart 7417.185 164.8263 12.83847 0.5687995
## metrics_rF 1889.738 41.99419 6.480292 0.8901394
## metrics_rFiltred 13.88588 0.3155882 0.5617724 0.5267713
## metrics_TM2 16.66435 0.370319 0.6085384 0.5256108
## metrics_TM1 15.48323 0.3440718 0.5865764 0.5592341
## metrics_lm 11898.95 264.4211 16.26103 0.3082507
Through this analysis, we explore various modeling techniques, from simple linear regression to advanced machine learning models, to forecast Weeks of Supply. Each model undergoes rigorous testing and evaluation, including handling outliers and applying transformations for better accuracy. By comparing evaluation metrics across models, the best approach for predicting Weeks of Supply can be identified.
After applying transformations, particularly the logarithmic transformation of WOS and the inverse of Average Sales, the model fit improved, showing better residual distribution and normality. This resulted in a more accurate predictive model. Further enhancements came from the application of machine learning models, where the Random Forest model performed exceptionally well. It captured complex, non-linear relationships that traditional linear models missed, providing superior prediction accuracy.
The Random Forest model, compared to others like CART, consistently delivered the best results in terms of error reduction and predictive reliability, minimizing overfitting while handling outliers more effectively. Overall, the use of machine learning models, particularly Random Forest, outperformed linear models, offering a robust solution for predicting Weeks of Supply, improving the accuracy and efficiency of inventory management.
The transformation models, especially the first transformation (log(WOS) and inverse of Average Sales), offer the most reliable and interpretable predictions for Week of Supply (WOS) projection. Random Forest remains a strong model in terms of raw predictive accuracy but is more complex and harder to interpret. The key challenges involved non-linearity, heteroscedasticity, and outliers, all of which were addressed through transformations and model improvements