Data 605 Final

Problem 1: Business Risk and Revenue Modeling

Context

You are a data scientist working for a retail chain that models sales, inventory levels, and the impact of pricing and seasonality on revenue. Your task is to analyze various distributions that can describe sales variability and forecast potential revenue.

Part 1: Empirical and Theoretical Analysis of Distributions

Task 1: Generate and Analyze Distributions

X ~ Sales: Assume the Sales variable follows a Gamma distribution and estimate its shape and scale parameters.
Y ~ Inventory Levels: Assume Inventory Levels follows a Lognormal distribution and estimate its parameters.
Z ~ Lead Time: Assume Lead Time Days follows a Normal distribution and estimate its mean and standard deviation.

# Load required packages
library(ggplot2)
library(MASS)

# Load the dataset
file_path <- "C:/Users/Admin/Downloads/synthetic_retail_data.csv"
data <- read.csv(file_path)

# Explore the data structure
str(data)

## 'data.frame':    200 obs. of  6 variables:
##  $ Product_ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Sales            : num  158 279 699 1832 460 ...
##  $ Inventory_Levels : num  367 427 408 392 448 ...
##  $ Lead_Time_Days   : num  6.31 5.8 3.07 3.53 10.8 ...
##  $ Price            : num  18.8 26.1 22.4 27.1 18.3 ...
##  $ Seasonality_Index: num  1.184 0.857 0.699 0.698 0.841 ...

# Sales (X) - Gamma Distribution
sales_fit <- try(fitdistr(data$Sales, "gamma"), silent = TRUE)

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

if (class(sales_fit) == "try-error") {
  # Alternative distribution fit (e.g., Weibull)
  weibull_fit <- fitdistr(data$Sales, "weibull")
  summary(weibull_fit)
  
  # Plot the Weibull distribution fit
  hist(data$Sales, breaks = 30, probability = TRUE, main = "Sales Data with Weibull Fit",
       xlab = "Sales", col = "#9db5cc")
  curve(dweibull(x, shape = weibull_fit$estimate['shape'], 
                 scale = weibull_fit$estimate['scale']), 
        col = "#baa8cb", lwd = 2, add = TRUE)
} else {
  # If Gamma fit works, show summary and plot
  summary(sales_fit)
  hist(data$Sales, breaks = 30, probability = TRUE, main = "Sales Data with Gamma Fit",
       xlab = "Sales", col = "#d07eae")
  curve(dgamma(x, shape = sales_fit$estimate['shape'], 
               rate = sales_fit$estimate['rate']), 
        col = "#b7cfb7", lwd = 2, add = TRUE)
}

When fitting a Gamma distribution to the Sales variable, the histogram demonstrates a good alignment between the fitted Gamma distribution and the actual data. The right-skewed nature of the data is well captured by the Gamma distribution, with the density curve closely following the observed frequencies of sales. This provides confidence that the Gamma distribution is a suitable model for the variability in sales.

From my perspective, this distribution fit helps in understanding the spread and pattern of sales, which is essential for forecasting and risk modeling. By estimating the shape and scale parameters, the model provides insights into the frequency of lower sales values and the decreasing likelihood of higher sales values. However, any deviation between the model and data, especially at the tails, could indicate areas for further exploration or refinement. This analysis reinforces the importance of selecting an appropriate distribution for effective sales modeling.

inventory_fit <- fitdistr(data$Inventory_Levels, "lognormal")
summary(inventory_fit)

##          Length Class  Mode   
## estimate 2      -none- numeric
## sd       2      -none- numeric
## vcov     4      -none- numeric
## n        1      -none- numeric
## loglik   1      -none- numeric

hist(data$Inventory_Levels, breaks = 30, probability = TRUE, main = "Inventory Levels with Lognormal Fit",
     xlab = "Inventory Levels", col = "#baa8cb")
curve(dlnorm(x, meanlog = inventory_fit$estimate["meanlog"], 
             sdlog = inventory_fit$estimate["sdlog"]), 
      col = "#d07eae", lwd = 2, add = TRUE)

When fitting a Lognormal distribution to the Inventory Levels variable, the histogram shows a strong alignment between the fitted Lognormal curve and the actual data. The density curve follows the observed frequencies well, capturing the unimodal shape and the slight right-skew of the data. The Lognormal distribution appears to be a suitable model for the variability in inventory levels, particularly as it accounts for the non-negative nature and asymmetric spread of the data.

From my perspective, this fit provides valuable insights into the distribution of inventory levels across products. By estimating the Lognormal parameters, the model effectively describes the most frequent inventory levels while acknowledging the presence of higher values. This understanding is critical for inventory planning and management, helping to anticipate variability and better allocate resources. Any deviations between the fit and the data may warrant further exploration, but overall, the Lognormal distribution appears to capture the key characteristics of the Inventory Levels variable.

lead_time_fit <- fitdistr(data$Lead_Time_Days, "normal")
summary(lead_time_fit)

##          Length Class  Mode   
## estimate 2      -none- numeric
## sd       2      -none- numeric
## vcov     4      -none- numeric
## n        1      -none- numeric
## loglik   1      -none- numeric

hist(data$Lead_Time_Days, breaks = 30, probability = TRUE, main = "Lead Time with Normal Fit",
     xlab = "Lead Time Days", col = "#9db5cc")
curve(dnorm(x, mean = lead_time_fit$estimate["mean"], 
            sd = lead_time_fit$estimate["sd"]), 
      col = "#d07eae", lwd = 2, add = TRUE)

When fitting a Normal distribution to the Lead Time Days variable, the histogram demonstrates a strong alignment between the fitted Normal curve and the actual data. The density curve closely follows the observed frequencies, capturing the symmetric and bell-shaped nature of the data. This suggests that the assumption of normality for lead times is valid and provides a good model for understanding the variability in this variable.

From my perspective, the Normal distribution fit is particularly useful for analyzing and predicting lead times, as it allows for straightforward calculation of probabilities and confidence intervals. The fit highlights that most lead times cluster around the mean, with a gradual tapering off toward the extremes. This information can be applied to improve planning and forecasting processes, ensuring more accurate and efficient inventory management. Any slight deviations in the tails should be noted, but overall, this fit adequately captures the key characteristics of Lead Time Days.

Task 2: Calculate Empirical Expected Value and Variance

Calculate the empirical mean and variance for Sales, Inventory Levels, and Lead Time.
Compare the empirical values with theoretical values derived from distribution parameters.

# Empirical Mean and Variance
empirical_stats <- data.frame(
  Variable = c("Sales", "Inventory Levels", "Lead Time"),
  Mean = c(mean(data$Sales), mean(data$Inventory_L), mean(data$Lead_Time)),
  Variance = c(var(data$Sales), var(data$Inventory_L), var(data$Lead_Time))
)

empirical_stats

##           Variable       Mean     Variance
## 1            Sales 636.916210 2.148318e+05
## 2 Inventory Levels 488.547176 2.403945e+04
## 3        Lead Time   6.834298 4.361587e+00

When calculating the empirical mean and variance for Sales, Inventory Levels, and Lead Time, I noticed distinct differences in variability among these variables. Sales had the highest mean (~637) and an exceptionally large variance (~214,831), which confirms that sales are highly variable and likely influenced by multiple external factors such as promotions or seasonality. In contrast, Inventory Levels had a smaller mean (~489) and a much lower variance (~24,039), indicating more consistent control over stock levels. Lead Time was the most stable, with a mean of ~6.83 and a very small variance (~4.36), reflecting reliable supplier or restocking performance.

This comparison gave me a clearer understanding of how these variables behave in the dataset. The high variance in sales reaffirms the need for robust demand forecasting to manage unpredictability, while the stability in inventory levels suggests effective stock management practices. Comparing these empirical values to the theoretical values from the fitted distributions also helped me validate the accuracy of the chosen distributions. This task highlighted the importance of understanding variability to make more informed decisions about inventory and operational planning.

Part 2: Probability Analysis and Independence Testing

Task 1: Empirical Probabilities

Calculate the probabilities for the Lead Time Days variable:
- \(P(Z > \mu - \sigma)\)
- \(P(Z > \mu + \sigma)\)
- \(P(Z > \mu + 2\sigma)\)

# Probabilities based on Lead_Time (Z ~ Normal)
mu <- mean(data$Lead_Time)
sigma <- sd(data$Lead_Time)

prob_1 <- pnorm(mu - sigma, mean = mu, sd = sigma)
prob_2 <- 1 - pnorm(mu + sigma, mean = mu, sd = sigma)
prob_3 <- 1 - pnorm(mu + 2 * sigma, mean = mu, sd = sigma)

probabilities <- data.frame(
  Probability = c("P(Z > mu - sigma)", "P(Z > mu + sigma)", "P(Z > mu + 2sigma)"),
  Value = c(prob_1, prob_2, prob_3)
)

probabilities

##          Probability      Value
## 1  P(Z > mu - sigma) 0.15865525
## 2  P(Z > mu + sigma) 0.15865525
## 3 P(Z > mu + 2sigma) 0.02275013

# Normality test for Lead_Time_Days
shapiro.test(data$Lead_Time_Days)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Lead_Time_Days
## W = 0.99618, p-value = 0.9026

When analyzing the probabilities for lead times (\(Z\)), the results confirmed the predictability and reliability of this variable. Approximately 15.87% of lead times fell either below \(\mu - \sigma\) or above \(\mu + \sigma\), which aligns with expectations for a normally distributed variable. Furthermore, only 2.28% of lead times exceeded \(\mu + 2\sigma\), demonstrating that extreme delays are rare.

The Shapiro-Wilk normality test (\(W = 0.99618\), \(p\)-value = 0.9026) further validated the assumption of normality for lead times, supporting the use of a normal distribution for this analysis. This statistical confirmation reassures me that lead times are highly consistent and exhibit minimal deviation from the average.

Understanding these probabilities is particularly useful for planning inventory replenishment schedules and ensuring timely deliveries. The reliability of lead times reduces operational uncertainty, enabling more accurate stock management and better preparedness for rare delays. Overall, this task reinforced the consistency of lead times as a key operational metric, providing a solid foundation for anticipating and managing potential risks.

Task 2: Correlation and Independence

Investigate the correlation between Sales and Price using a contingency table of quartiles.
Use Fisher’s Exact Test and Chi-Square Test to check for independence.

# Correlation between Sales and Price
correlation <- cor(data$Sales, data$Price)

# Create a contingency table for quartiles
data$Sales_Quartile <- cut(data$Sales, quantile(data$Sales, probs = seq(0, 1, 0.25)), include.lowest = TRUE)
data$Price_Quartile <- cut(data$Price, quantile(data$Price, probs = seq(0, 1, 0.25)), include.lowest = TRUE)

contingency_table <- table(data$Sales_Quartile, data$Price_Quartile)
contingency_table

##                 
##                  [5.05,16.6] (16.6,20] (20,22.9] (22.9,29.4]
##   [25.6,284]              11        16        12          11
##   (284,534]               13        10        15          12
##   (534,868]               15        10        13          12
##   (868,2.45e+03]          11        14        10          15

# Fisher's Exact Test with increased workspace
fisher_test <- fisher.test(contingency_table, workspace = 2e7)

# Chi-Square Test
chisq_test <- chisq.test(contingency_table)

list(Fisher_Test = fisher_test, Chi_Square_Test = chisq_test)

## $Fisher_Test
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.8637
## alternative hypothesis: two.sided
## 
## 
## $Chi_Square_Test
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 4.8, df = 9, p-value = 0.8514

In examining the relationship between Sales and Price using contingency tables and statistical tests, I observed that the p-values from both Fisher’s Exact Test (0.8637) and the Chi-Square Test (0.8514) were much greater than 0.05. This indicates that there is no significant association between Sales and Price quartiles in the dataset. The weak relationship aligns with the low correlation coefficient seen earlier, reinforcing the conclusion that pricing does not strongly influence sales behavior within the observed range.

From my perspective, this lack of association suggests that other factors, such as product quality, marketing strategies, or customer preferences, likely play a more significant role in driving sales. Pricing alone might not be an effective lever for increasing sales, and it would be more impactful to explore these alternative factors or consider combined strategies like promotions or bundling. This task emphasized the value of using robust statistical tests to validate initial observations, ensuring that decisions are based on evidence rather than assumptions.

Problem 2: Advanced Forecasting and Optimization (Calculus)

Part 1: Descriptive and Inferential Statistics for Inventory Data

Task 1: Inventory Data Analysis

Generate univariate descriptive statistics for Inventory Levels and Sales.
Create visualizations (histograms and scatterplots).
Compute a correlation matrix for Sales, Price, and Inventory Levels.
Test hypotheses about correlations with a 95% confidence interval.

# Descriptive Statistics
summary(data[c("Inventory_Levels", "Sales")])

##  Inventory_Levels     Sales        
##  Min.   : 67.35   Min.   :  25.57  
##  1st Qu.:376.51   1st Qu.: 284.42  
##  Median :483.72   Median : 533.54  
##  Mean   :488.55   Mean   : 636.92  
##  3rd Qu.:600.42   3rd Qu.: 867.58  
##  Max.   :858.79   Max.   :2447.49

# Visualizations
ggplot(data, aes(x = Sales)) +
  geom_histogram(binwidth = 100, fill = "#9db5cc", color = "#d07eae") +
  theme_minimal() +
  labs(title = "Histogram of Sales", x = "Sales", y = "Count")

# Simple Scatterplot of Inventory Levels vs Sales
ggplot(data, aes(x = Inventory_Levels, y = Sales)) +
  geom_point(color = "#baa8cb") +
  theme_minimal() +
  labs(title = "Scatterplot of Inventory Levels vs Sales", x = "Inventory Levels", y = "Sales")

# Enhanced Scatterplot with Regression Line and Annotation
ggplot(data, aes(x = Inventory_Levels, y = Sales)) +
  geom_point(color = "#d07eae") +
  geom_smooth(method = "lm", color = "#baa8cb", se = FALSE) +
  labs(title = "Scatterplot with Regression Line", x = "Inventory Levels", y = "Sales") +
  annotate("text", x = 500, y = 1500, label = "Weak correlation", color = "#9db5cc")

## `geom_smooth()` using formula = 'y ~ x'

Analyzing the descriptive statistics and visualizations for Inventory Levels and Sales provided key insights into their relationship. The histogram of Sales shows a skewed distribution, with most values below 1,000 and a few outliers contributing significantly to revenue. The enhanced scatterplot with a regression line further highlights the weak linear relationship between Inventory Levels and Sales. The regression line shows little variation, reinforcing the low correlation observed earlier. This suggests that factors beyond inventory levels, such as demand fluctuations or seasonality, might play a larger role in driving sales.

Correlation analysis between Sales, Price, and Inventory Levels confirmed weak relationships among these variables, which aligns with the visual findings. This lack of strong correlation suggests the need to explore additional models or factors to understand the dynamics between inventory and sales better. Incorporating variables like demand trends or promotional activity could enhance the predictive power of inventory optimization models. This task emphasized the importance of combining descriptive analytics with more advanced modeling techniques to capture the complex interplay of factors influencing sales.

# Correlation Matrix
cor_matrix <- cor(data[c("Sales", "Price", "Inventory_Levels")])
cor_matrix

##                        Sales       Price Inventory_Levels
## Sales             1.00000000  0.10272730      -0.03529619
## Price             0.10272730  1.00000000      -0.04025941
## Inventory_Levels -0.03529619 -0.04025941       1.00000000

After computing the correlation matrix for Sales, Price, and Inventory Levels, I noticed that the correlations between these variables are very weak. The correlation between Sales and Price is 0.10, suggesting a slight positive relationship, but it’s too weak to be meaningful. Similarly, the correlation between Inventory Levels and Sales is almost negligible at -0.035, indicating no substantial linear relationship. The correlation between Price and Inventory Levels is also weak and negative (-0.04), further emphasizing the lack of strong interactions.

From my perspective, these results highlight the complexity of the dataset. Sales, Price, and Inventory Levels appear to operate independently or are influenced by factors not captured in this analysis, such as customer demand, seasonality, or marketing strategies. These findings suggest that simple linear relationships are insufficient to explain the variability in sales or inventory levels, motivating the need to explore non-linear models or introduce additional variables to better understand these dynamics.

# Hypothesis Testing for Correlation
cor_test <- cor.test(data$Sales, data$Price)
cor_test

## 
##  Pearson's product-moment correlation
## 
## data:  data$Sales and data$Price
## t = 1.4532, df = 198, p-value = 0.1478
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03653442  0.23807516
## sample estimates:
##       cor 
## 0.1027273

After performing a hypothesis test for the correlation between Sales and Price, the p-value of 0.1478 suggests that the correlation is not statistically significant at the 95% confidence level. This result aligns with the earlier observed correlation coefficient of 0.10, which is weak and indicates little to no linear relationship between these variables. The 95% confidence interval for the correlation (-0.0365 to 0.2381) further supports this conclusion, as it includes zero, meaning the true correlation could indeed be negligible.

From my perspective, this confirms that Price has limited direct influence on Sales within this dataset. It’s possible that other factors, such as customer preferences, seasonality, or marketing strategies, have a stronger impact on sales performance. This reinforces the importance of exploring additional variables and potentially more complex models to understand the key drivers of sales. Overall, this test provides clarity and eliminates Price as a dominant factor in influencing sales outcomes.

Part 2: Linear Algebra and Pricing Strategy

Task 1: Price Elasticity of Demand

Use linear regression to model the relationship between Sales and Price.
Invert the correlation matrix and interpret results.
Perform LU decomposition on the correlation matrix.

# Linear Regression
reg_model <- lm(Sales ~ Price, data = data)
summary(reg_model)

## 
## Call:
## lm(formula = Sales ~ Price, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -679.54 -347.85  -98.63  241.12 1770.08 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  442.951    137.419   3.223  0.00148 **
## Price          9.916      6.824   1.453  0.14775   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 462.2 on 198 degrees of freedom
## Multiple R-squared:  0.01055,    Adjusted R-squared:  0.005556 
## F-statistic: 2.112 on 1 and 198 DF,  p-value: 0.1478

# Invert the correlation matrix
precision_matrix <- solve(cor_matrix)
precision_matrix

##                        Sales       Price Inventory_Levels
## Sales             1.01165983 -0.10265390       0.03157495
## Price            -0.10265390  1.01203982       0.03712083
## Inventory_Levels  0.03157495  0.03712083       1.00260894

# LU Decomposition
lu_decomp <- Matrix::lu(cor_matrix)
lu_decomp

## LU factorization of Formal class 'denseLU' [package "Matrix"] with 4 slots
##   ..@ x       : num [1:9] 1 0.1027 -0.0353 0.1027 0.9894 ...
##   ..@ perm    : int [1:3] 1 2 3
##   ..@ Dim     : int [1:2] 3 3
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:3] "Sales" "Price" "Inventory_Levels"
##   .. ..$ : chr [1:3] "Sales" "Price" "Inventory_Levels"

From my analysis of the relationship between Sales and Price, the linear regression model revealed a weak connection. The \(R^2\) value is 0.01055, meaning Price explains just about 1% of the variability in Sales. Additionally, the p-value for the Price coefficient (0.14775) indicates it is not statistically significant at the 95% confidence level. This suggests that changes in Price alone have minimal impact on Sales in this dataset. The intercept, however, was statistically significant, providing a baseline sales value when Price is zero. This weak model performance highlights the need to explore other variables that might better explain sales trends.

When I inverted the correlation matrix to calculate the precision matrix, it provided insights into the dependencies among variables. The diagonal values in the precision matrix are close to 1, confirming that there is minimal multicollinearity among Sales, Price, and Inventory Levels. This aligns with the earlier observed weak correlations between these variables.

Finally, the LU decomposition of the correlation matrix broke it into its lower and upper triangular components, confirming the matrix’s structure and numerical stability. While this decomposition is useful for computational purposes, it didn’t reveal any additional relationships or dependencies among the variables. Overall, this task reinforced that Price is not a strong driver of Sales, and further analysis is needed to identify factors that significantly influence sales behavior.

Part 3: Calculus-Based Probability & Statistics for Sales Forecasting

Task 1: Sales Forecasting Using Exponential Distribution

Fit an exponential distribution to Sales.
Generate samples and compare histograms.
Compute the 5th and 95th percentiles and confidence intervals.

# Fit Exponential Distribution
exp_fit <- fitdistr(data$Sales, "exponential")
summary(exp_fit)

##          Length Class  Mode   
## estimate 1      -none- numeric
## sd       1      -none- numeric
## vcov     1      -none- numeric
## n        1      -none- numeric
## loglik   1      -none- numeric

# Generate Samples
set.seed(123)
samples <- rexp(1000, rate = exp_fit$estimate)
ggplot() +
  geom_histogram(aes(samples), binwidth = 50, fill = "#baa8cb", color = "#d07eae") +
  geom_histogram(aes(data$Sales), binwidth = 50, alpha = 0.5, fill = "#9db5cc") +
  theme_minimal() +
  labs(title = "Histogram: Exponential Samples vs Original Sales", x = "Sales", y = "Count")

Fitting an exponential distribution to the Sales data and generating random samples provided interesting insights. The histogram shows a reasonable alignment between the exponential samples (in pink) and the original Sales data (in light purple). Both distributions capture the right-skewed nature of the data, with a higher concentration of values near zero and a gradual decline as values increase. However, the exponential model slightly underestimates the frequency of mid-range sales values (e.g., 500-1000) compared to the actual data.

From my perspective, while the exponential distribution captures the overall shape of the data, it does not fully capture the nuances, particularly for higher sales ranges. This suggests that the exponential distribution is a good starting point for modeling, but it might need refinement or supplementation with a more complex model (e.g., a mixture distribution) to better fit the original Sales data. Overall, this analysis emphasizes the need to validate distribution assumptions by comparing generated samples to the actual data.

# Compute 5th and 95th percentiles
percentiles <- qexp(c(0.05, 0.95), rate = exp_fit$estimate)
percentiles

## [1]   32.66953 1908.03045

# Compute confidence intervals for Sales
conf_int <- quantile(data$Sales, probs = c(0.05, 0.95))
conf_int

##        5%       95% 
##  104.9028 1502.2498

Calculating the 5th and 95th percentiles for the exponential distribution and actual Sales data revealed important differences. The exponential model’s 5th percentile (~32.67) and 95th percentile (~1908.03) underestimated and overestimated the actual Sales data percentiles (104.90 and 1502.25), respectively. This shows that the exponential distribution captures the general right-skewed trend but overestimates variability compared to the observed data.

From my perspective, these discrepancies highlight the need to refine the model or explore alternatives for better accuracy, especially for forecasting or decision-making tasks. The narrower range in the actual data suggests less extreme variability than the exponential model predicts, underscoring the importance of validating assumptions.

Task 2: Discussion

Provide insights on the exponential distribution’s suitability for forecasting.

Part 4: Regression Modeling for Inventory Optimization

Task 1: Multiple Regression Model

Build a multiple regression model to predict Inventory Levels based on Sales, Lead Time, and Price.

# Multiple Regression Model
multi_reg_model <- lm(Inventory_Levels ~ Sales + Lead_Time_Days + Price, data = data)
summary(multi_reg_model)

## 
## Call:
## lm(formula = Inventory_Levels ~ Sales + Lead_Time_Days + Price, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.54 -118.07   -7.68  111.81  372.56 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    464.792662  61.321852   7.580 1.35e-12 ***
## Sales           -0.007809   0.023955  -0.326    0.745    
## Lead_Time_Days   7.316793   5.293049   1.382    0.168    
## Price           -1.087778   2.305846  -0.472    0.638    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 155.3 on 196 degrees of freedom
## Multiple R-squared:  0.01223,    Adjusted R-squared:  -0.002887 
## F-statistic: 0.8091 on 3 and 196 DF,  p-value: 0.4902

# Check for multicollinearity
library(car)

## Loading required package: carData

vif(multi_reg_model)

##          Sales Lead_Time_Days          Price 
##       1.017562       1.008633       1.011822

Building a multiple regression model to predict Inventory Levels using Sales, Lead Time, and Price provided insights but underscored the limitations of this model. The \(R^2\) value is 0.01223, indicating that only about 1.2% of the variance in Inventory Levels is explained by these predictors. The adjusted \(R^2\), which is slightly negative, further confirms that the model does not effectively capture the relationships in the data. Additionally, the p-values for all predictors are above 0.05, meaning none of them are statistically significant at the 95% confidence level.

The Variance Inflation Factor (VIF) values for Sales, Lead Time, and Price are all close to 1 (ranging from 1.0086 to 1.0176), suggesting that multicollinearity is not a concern in this model. This reassures me that the predictors are not redundant, but their weak individual contributions indicate that they may not be the primary drivers of inventory levels.

From my perspective, these results suggest that Sales, Lead Time, and Price are not strong predictors of Inventory Levels in this dataset. Other factors, such as demand fluctuations, seasonal trends, or supplier constraints, might play a more significant role. The significant intercept highlights a baseline inventory level that exists independently of the included predictors. Moving forward, it would be necessary to explore additional variables or alternative modeling approaches to improve predictive accuracy for inventory management. This task highlights the importance of carefully selecting relevant features to develop more meaningful and reliable models.

Task 2: Optimization

Use model coefficients to optimize inventory levels for peak sales seasons.

# Use coefficients to optimize inventory levels
optimized_inventory <- predict(multi_reg_model, newdata = data.frame(
  Sales = c(1000, 1500),
  Lead_Time_Days = c(5, 7),  # Corrected column name
  Price = c(20, 25)
))

optimized_inventory

##        1        2 
## 471.8124 477.1027

Using the coefficients from the multiple regression model, I predicted inventory levels for peak sales scenarios. The predictions yielded inventory levels of approximately 471.81 units for a sales volume of 1,000 and 477.10 units for a sales volume of 1,500, given the specified lead times (5 and 7 days) and prices (20 and 25). From my perspective, these results offer a baseline for planning inventory during high-demand periods. However, given the weak performance of the regression model (as indicated by the low \(R^2\)), these predictions should be interpreted cautiously. The small difference in predicted inventory levels despite significant changes in sales suggests the model may not fully capture the factors driving inventory needs.

To minimize stockouts, the predicted values could be adjusted upward by incorporating safety stock levels based on historical demand variability. Conversely, to reduce overstock, real-time monitoring of seasonality indices or adjustments based on market trends may be necessary. Additionally, integrating other factors such as supplier constraints or storage capacity could further refine these inventory predictions. This highlights the importance of supplementing model outputs with domain knowledge and contextual adjustments for effective inventory management.

Conclusion

This final project showcased a comprehensive application of statistical modeling, data visualization, and regression analysis to tackle key challenges in retail data analysis. By fitting distributions to key variables such as Sales, Inventory Levels, and Lead Time, I gained insights into their variability and patterns. Probability analysis and independence testing highlighted the weak relationships among variables like Sales and Price, emphasizing the need for additional factors to explain sales behavior effectively.

Regression modeling further revealed the limitations of using a linear approach to predict Inventory Levels, as the low \(R^2\) value indicated that the chosen predictors (Sales, Lead Time, and Price) did not capture enough variance. While the optimization task provided a baseline for inventory planning, it also underscored the importance of integrating additional predictors or using alternative modeling approaches to improve accuracy.

From my perspective, this project highlighted the critical role of exploratory data analysis, statistical validation, and contextual understanding in creating meaningful business insights. Future efforts could focus on incorporating external factors, such as seasonality indices or marketing efforts, to improve the predictive power of the models. Overall, this project reinforced the value of data-driven strategies in solving complex retail challenges and laid a solid foundation for further analysis and improvement.