0.0.1 Project Description: Week of Supply Projection and Predictive Model Building

The goal of this analysis is to develop predictive models for estimating Weeks of Supply (WOS) based on Average Sales data. Multiple statistical models, including linear regression, transformations, and machine learning models like Random Forest and CART (Classification and Regression Trees), are employed to determine the best predictive approach. Additionally, model diagnostics and evaluation metrics are used to assess the performance of each model.

0.0.2 Steps in the Process

  1. Loading Libraries and Data The required libraries such as tidyverse, readxl, dlookr, and others are loaded for data manipulation, visualization, and modeling. The dataset is loaded from an Excel file and cleaned by removing unnecessary columns such as Slope and Intercept, text-based columns, and those starting with "..." (possibly temporary columns).

  2. Initial Data Exploration Basic visualizations and diagnostics are performed:

    • Correlation Plot: Investigating the correlation between variables.

    • Normality Check: Testing whether the data is normally distributed using visual tools and statistical tests like Shapiro-Wilk.

    • Outlier Detection: Identifying and diagnosing potential outliers in the dataset using plot_outlier() and diagnose_outlier().

  3. Simple Linear Regression (SLR) A simple linear regression model is created to understand the relationship between Average Sales and Weeks of Supply (WOS). Key steps include:

    • Plotting the relationship between Average Sales and WOS.

    • Performing a linear regression analysis.

    • Evaluating model assumptions through diagnostics (Breusch-Pagan test, Goldfeld-Quandt test, etc.).

    • Checking the model summary and residuals for accuracy.

  4. Variable Transformation for Improved Fit Transformation techniques are applied to improve the linearity and normality of the data:

    • First Transformation Model: Log-transforming WOS and applying the inverse of Average Sales to linearize the relationship.

    • Second Transformation Model: Log-transforming WOS and taking the square root of Average Sales for better results.

    • Each transformation is followed by Shapiro-Wilk tests for normality, model fitting using linear regression, and residual diagnostics to ensure assumptions are met.

  5. Handling Outliers Outliers are filtered based on the interquartile range (IQR) method. Data points outside 1.5 times the IQR are excluded from the analysis. This reduces noise and enhances model performance.

  6. Random Forest Model A Random Forest model is trained to predict Weeks of Supply (WOS) from Average Sales. The model’s predictive power is evaluated and residual analysis is conducted. Random Forest is a robust ensemble learning method that reduces overfitting and improves accuracy compared to simple linear models.

  7. CART (Classification and Regression Tree) A CART model is developed as another alternative to predict WOS. The tree structure of the model helps capture non-linear relationships between the variables. CART model results are visualized using rpart.plot(), and predictions are compared to actual values.

  8. Model Evaluations Several models are evaluated using custom functions that calculate:

    • SSE (Sum of Squared Errors)

    • MSE (Mean Squared Error)

    • RMSE (Root Mean Squared Error)

    • R-Squared (Coefficient of Determination)

    This function is applied to all models, including:

    • The initial linear regression model.

    • The transformed models.

    • The Random Forest and CART models.

    These metrics are consolidated to determine which model performs best in predicting Weeks of Supply (WOS).

library(tidyverse) 
library(readxl)
library(dlookr) 
library(caTools)
library(lmtest)
library(rpart) 
library(rpart.plot) 
library(randomForest)
set.seed(123)
data <- read_excel("C:/Users/DELL/Desktop/Kolabtree_Projects/Week of supply prediction/Retails prediction dataset..xlsx") 
data%>%head(10) 
data <- data%>% select(-c(Slope,Intercept)) 
data <- data %>% select(-where(is.character)) 
data <- data %>% select(-starts_with("...")) 
plot_correlate(data) 

plot_normality(data) 

plot_outlier(data)

diagnose_outlier(data)

1 Simple linear regression

ggplot(data, aes(x = `Avg Sales`, y = WOS)) + geom_point() + geom_smooth(method = "lm", color = "blue") + ggtitle("Average Sales vs. Weeks of Supply (WOS)") + theme( plot.title = element_text(hjust = 0.5, face = "bold"),axis.title.x = element_text(face = "bold"),axis.title.y = element_text(face = "bold"),axis.text.x = element_text(face = "bold"),axis.text.y = element_text(face = "bold")) + labs(x = "Average Sales", y = "Weeks of Supply (WOS)")

cor.test(data$WOS, data$`Avg Sales`) 
## 
##  Pearson's product-moment correlation
## 
## data:  data$WOS and data$`Avg Sales`
## t = -4.3774, df = 43, p-value = 7.544e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7298009 -0.3126169
## sample estimates:
##        cor 
## -0.5552033
shapiro.test(data$WOS)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$WOS
## W = 0.63144, p-value = 2.169e-09
shapiro.test(data$`Avg Sales`) 
## 
##  Shapiro-Wilk normality test
## 
## data:  data$`Avg Sales`
## W = 0.91556, p-value = 0.002999
model <- lm(WOS~`Avg Sales`, data=data) 
summary(model) 
## 
## Call:
## lm(formula = WOS ~ `Avg Sales`, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.767 -10.267  -3.142   4.671  80.482 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.83030    6.08862   6.706 3.43e-08 ***
## `Avg Sales` -0.31251    0.07139  -4.377 7.54e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.63 on 43 degrees of freedom
## Multiple R-squared:  0.3083, Adjusted R-squared:  0.2922 
## F-statistic: 19.16 on 1 and 43 DF,  p-value: 7.544e-05
plot(model) 

predite_lm <- predict(model, data) 
compare_lm<- data.frame(predicted =predite_lm, Observed=data$WOS, residual=residuals.lm(model, type="deviance")) 
compare_lm%>%head(10)

1.1 Model evaluation

bptest(model) 
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 3.1253, df = 1, p-value = 0.07708
gqtest(model) 
## 
##  Goldfeld-Quandt test
## 
## data:  model
## GQ = 91.219, df1 = 21, df2 = 20, p-value = 1.667e-15
## alternative hypothesis: variance increases from segment 1 to 2
hmctest(model)
## 
##  Harrison-McCabe test
## 
## data:  model
## HMC = 0.25463, p-value = 0.011

2 TRANSFORMATION OF THE VARIABLES

2.1 1. Fist transformation model

dat_t <- transmute(data, lg_wos = log(WOS), inv = 1/`Avg Sales`) 
ggplot(dat_t, aes(x= inv, y=lg_wos))+geom_point(col="red")+geom_smooth(method = "lm") 

diagnose_outlier(dat_t) 
shapiro.test(dat_t$lg_wos)
## 
##  Shapiro-Wilk normality test
## 
## data:  dat_t$lg_wos
## W = 0.9799, p-value = 0.6159
shapiro.test(dat_t$inv)
## 
##  Shapiro-Wilk normality test
## 
## data:  dat_t$inv
## W = 0.75462, p-value = 2.74e-07
dat_1_model <- lm(lg_wos~inv, data=dat_t) 
summary(dat_1_model)
## 
## Call:
## lm(formula = lg_wos ~ inv, data = dat_t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26196 -0.46450 -0.01392  0.58026  1.00151 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.1901     0.1837   6.479 7.35e-08 ***
## inv          74.9665    10.1494   7.386 3.56e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6001 on 43 degrees of freedom
## Multiple R-squared:  0.5592, Adjusted R-squared:  0.549 
## F-statistic: 54.56 on 1 and 43 DF,  p-value: 3.565e-09
plot(dat_1_model)

pp1 <- predict(dat_1_model, newdata = dat_t) 
combined_dat_1 <- data.frame(predicted = pp1, Observed=dat_t$lg_wos, Residual=residuals.lm(dat_1_model, type="deviance")) 
combined_dat_1%>%head(10) 

2.2 Model evaluation

bptest(dat_1_model) 
## 
##  studentized Breusch-Pagan test
## 
## data:  dat_1_model
## BP = 0.076525, df = 1, p-value = 0.7821
gqtest(dat_1_model) 
## 
##  Goldfeld-Quandt test
## 
## data:  dat_1_model
## GQ = 0.99719, df1 = 21, df2 = 20, p-value = 0.504
## alternative hypothesis: variance increases from segment 1 to 2
hmctest(dat_1_model)
## 
##  Harrison-McCabe test
## 
## data:  dat_1_model
## HMC = 0.44123, p-value = 0.315

3 2. SECOND TRANSFORMATION MODEL

ggplot(data, aes(sqrt(`Avg Sales`), log(WOS)))+geom_point()+ geom_smooth(method = "lm")+ ggtitle("Square root of Average Sales vs. Log of Weeks of Supply (WOS)") + theme( plot.title = element_text(hjust = 0.5, face = "bold"),axis.title.x = element_text(face = "bold"),axis.title.y = element_text(face = "bold"),axis.text.x = element_text(face = "bold"),axis.text.y = element_text(face = "bold")) + labs(x = "Average Sales", y = "Weeks of Supply (WOS)")

dat_t_2 <- transmute(data, lg_wos = log(WOS), sqrAVs = sqrt(`Avg Sales`))
shapiro.test(dat_t_2$lg_wos)
## 
##  Shapiro-Wilk normality test
## 
## data:  dat_t_2$lg_wos
## W = 0.9799, p-value = 0.6159
shapiro.test(dat_t_2$sqrAVs)
## 
##  Shapiro-Wilk normality test
## 
## data:  dat_t_2$sqrAVs
## W = 0.97572, p-value = 0.4579
dat_2_model <- lm(lg_wos~sqrAVs, data = dat_t_2) 
summary(dat_2_model) 
## 
## Call:
## lm(formula = lg_wos ~ sqrAVs, data = dat_t_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2496 -0.5814  0.1145  0.4823  0.9632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.28749    0.43201  12.239 1.33e-15 ***
## sqrAVs      -0.33787    0.04895  -6.902 1.78e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6225 on 43 degrees of freedom
## Multiple R-squared:  0.5256, Adjusted R-squared:  0.5146 
## F-statistic: 47.64 on 1 and 43 DF,  p-value: 1.782e-08
plot(dat_2_model)

pp <- predict(dat_2_model, newdata = dat_t_2) 
combined_dat_2 <- data.frame(predicted = pp, Observed=dat_t_2$lg_wos, residuals.lm(dat_2_model, type="deviance")) 
combined_dat_2%>%head(10)

3.1 Model evaluation

bptest(dat_2_model) 
## 
##  studentized Breusch-Pagan test
## 
## data:  dat_2_model
## BP = 9.3406e-05, df = 1, p-value = 0.9923
gqtest(dat_2_model) 
## 
##  Goldfeld-Quandt test
## 
## data:  dat_2_model
## GQ = 1.3512, df1 = 21, df2 = 20, p-value = 0.2523
## alternative hypothesis: variance increases from segment 1 to 2
hmctest(dat_2_model)
## 
##  Harrison-McCabe test
## 
## data:  dat_2_model
## HMC = 0.52908, p-value = 0.657

4 Handling outliers

Q1 <- quantile(dat_t$lg_wos, 0.25, na.rm = TRUE)
Q3 <- quantile(dat_t$lg_wos, 0.75, na.rm = TRUE) 
IQR <- Q3 - Q1 
lower_bound <- Q1 - 1.5 * IQR 
upper_bound <- Q3 + 1.5 * IQR 
filtered_data <- dat_t %>% filter(lg_wos < 4.75359) 
max(dat_t$lg_wos) 
## [1] 4.75359
plot_normality(filtered_data)

5 Filtering out the outliers

Q1 <- quantile(filtered_data$inv, 0.25, na.rm = TRUE)
Q3 <- quantile(filtered_data$inv, 0.75, na.rm = TRUE) 
IQR <- Q3 - Q1 
lower_bound <- Q1 - 1.5 * IQR 
upper_bound <- Q3 + 1.5 * IQR
filter <- filtered_data%>% filter(inv < 0.0272608) 
plot_normality(filter)

ggplot(filtered_data, aes(inv, lg_wos))+geom_point()+ geom_smooth(method = "lm")+ ggtitle("Square root of Average Sales vs. Log of Weeks of Supply (WOS)") + theme( plot.title = element_text(hjust = 0.5, face = "bold"),axis.title.x = element_text(face = "bold"),axis.title.y = element_text(face = "bold"),axis.text.x = element_text(face = "bold"),axis.text.y = element_text(face = "bold")) + labs(x = "Average Sales", y = "Weeks of Supply (WOS)")

shapiro.test(filtered_data$lg_wos)
## 
##  Shapiro-Wilk normality test
## 
## data:  filtered_data$lg_wos
## W = 0.98167, p-value = 0.7005
shapiro.test(filtered_data$inv)
## 
##  Shapiro-Wilk normality test
## 
## data:  filtered_data$inv
## W = 0.92776, p-value = 0.008733
Modl1 <- lm(lg_wos~inv, filtered_data) 
summary(Modl1) 
## 
## Call:
## lm(formula = lg_wos ~ inv, data = filtered_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17375 -0.52217 -0.03708  0.47026  1.02112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.8649     0.2299   3.762 0.000517 ***
## inv          98.1918    14.3607   6.838 2.48e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.575 on 42 degrees of freedom
## Multiple R-squared:  0.5268, Adjusted R-squared:  0.5155 
## F-statistic: 46.75 on 1 and 42 DF,  p-value: 2.483e-08
p1 <- predict(Modl1, newdata = filtered_data) 
y <- data.frame(predicted=p1, observed=filtered_data$lg_wos, residuals.lm(Modl1, type="deviance")) 
y%>%head(10)

5.1 Model evaluation

bptest(Modl1) 
## 
##  studentized Breusch-Pagan test
## 
## data:  Modl1
## BP = 2.2097, df = 1, p-value = 0.1371
gqtest(Modl1) 
## 
##  Goldfeld-Quandt test
## 
## data:  Modl1
## GQ = 1.0037, df1 = 20, df2 = 20, p-value = 0.4968
## alternative hypothesis: variance increases from segment 1 to 2
hmctest(Modl1)
## 
##  Harrison-McCabe test
## 
## data:  Modl1
## HMC = 0.44756, p-value = 0.293

6 Random forest model

library(randomForest) # RANDOMFOREST MODEL

ran <-randomForest(WOS~ data$`Avg Sales`, data)
summary(ran)
##                 Length Class  Mode     
## call              3    -none- call     
## type              1    -none- character
## predicted        45    -none- numeric  
## mse             500    -none- numeric  
## rsq             500    -none- numeric  
## oob.times        45    -none- numeric  
## importance        1    -none- numeric  
## importanceSD      0    -none- NULL     
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           11    -none- list     
## coefs             0    -none- NULL     
## y                45    -none- numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call
predit <- predict(ran, data)
com <- data.frame(predicted=predit, observed=data$WOS, residual = data$WOS-predit) 
com%>%head(10) 
plot(ran)+title("Random Forest Model")

## numeric(0)

7 Classification and Regression Tree

cart <-rpart(WOS~`Avg Sales`, data) 
summary(cart)
## Call:
## rpart(formula = WOS ~ `Avg Sales`, data = data)
##   n= 45 
## 
##           CP nsplit rel error    xerror      xstd
## 1 0.54580769      0 1.0000000 1.0463020 0.5973021
## 2 0.02299183      1 0.4541923 0.5643311 0.3536721
## 3 0.01000000      2 0.4312005 0.5626342 0.3537326
## 
## Variable importance
## Avg Sales 
##       100 
## 
## Node number 1: 45 observations,    complexity param=0.5458077
##   mean=16.48889, MSE=382.2499 
##   left son=2 (38 obs) right son=3 (7 obs)
##   Primary splits:
##       Avg Sales < 46.5 to the right, improve=0.5458077, (0 missing)
## 
## Node number 2: 38 observations,    complexity param=0.02299183
##   mean=10.28947, MSE=44.04778 
##   left son=4 (18 obs) right son=5 (20 obs)
##   Primary splits:
##       Avg Sales < 81   to the right, improve=0.2362793, (0 missing)
## 
## Node number 3: 7 observations
##   mean=50.14286, MSE=876.9796 
## 
## Node number 4: 18 observations
##   mean=6.888889, MSE=17.98765 
## 
## Node number 5: 20 observations
##   mean=13.35, MSE=47.7275
rpart.plot(cart)

predit_cart <- predict(cart, data) 
com_cart <- data.frame(predicted=predit_cart, observed=data$WOS, residual=data$WOS- predit_cart) 
com_cart%>%head(10)

8 MODELS EVALUATIONS

calculate_metrics <- function(actual, predicted) 
  { residuals <- actual - predicted 
  sse <- sum(residuals^2) 
  mse <- mean(residuals^2) 
  rmse <- sqrt(mse) 
  r_squared <- 1 - (sum(residuals^2) / sum((actual - mean(actual))^2))
  list(SSE = sse, MSE = mse, RMSE = rmse, R_squared = r_squared) }
metrics_cart <- calculate_metrics(com_cart$observed, com_cart$predicted) 
metrics_rF <- calculate_metrics(com$observed, com$predicted) 
metrics_rFiltred <- calculate_metrics(y$observed, y$predicted) 
metrics_TM2 <- calculate_metrics(combined_dat_2$Observed, combined_dat_2$predicted) 
metrics_TM1 <- calculate_metrics(combined_dat_1$Observed, combined_dat_1$predicted) 
metrics_lm <- calculate_metrics(compare_lm$Observed, compare_lm$predicted)
rbind(metrics_cart,metrics_rF,metrics_rFiltred,metrics_TM2,metrics_TM1,metrics_lm)
##                  SSE      MSE       RMSE      R_squared
## metrics_cart     7417.185 164.8263  12.83847  0.5687995
## metrics_rF       1889.738 41.99419  6.480292  0.8901394
## metrics_rFiltred 13.88588 0.3155882 0.5617724 0.5267713
## metrics_TM2      16.66435 0.370319  0.6085384 0.5256108
## metrics_TM1      15.48323 0.3440718 0.5865764 0.5592341
## metrics_lm       11898.95 264.4211  16.26103  0.3082507

8.0.1 Conclusion

Through this analysis, we explore various modeling techniques, from simple linear regression to advanced machine learning models, to forecast Weeks of Supply. Each model undergoes rigorous testing and evaluation, including handling outliers and applying transformations for better accuracy. By comparing evaluation metrics across models, the best approach for predicting Weeks of Supply can be identified.

After applying transformations, particularly the logarithmic transformation of WOS and the inverse of Average Sales, the model fit improved, showing better residual distribution and normality. This resulted in a more accurate predictive model. Further enhancements came from the application of machine learning models, where the Random Forest model performed exceptionally well. It captured complex, non-linear relationships that traditional linear models missed, providing superior prediction accuracy.

The Random Forest model, compared to others like CART, consistently delivered the best results in terms of error reduction and predictive reliability, minimizing overfitting while handling outliers more effectively. Overall, the use of machine learning models, particularly Random Forest, outperformed linear models, offering a robust solution for predicting Weeks of Supply, improving the accuracy and efficiency of inventory management.

The transformation models, especially the first transformation (log(WOS) and inverse of Average Sales), offer the most reliable and interpretable predictions for Week of Supply (WOS) projection. Random Forest remains a strong model in terms of raw predictive accuracy but is more complex and harder to interpret. The key challenges involved non-linearity, heteroscedasticity, and outliers, all of which were addressed through transformations and model improvements