Introduction

The following RMD contains CUNY SPS DATA 605 Fall 2025 Final Exam assignment. This report contains two problems with labeled parts to each of them. This analysis examines retail sales, inventory, lead time, and pricing to support data-driven decisions. I fit statistical distributions to key variables and checked correlations and independence between Sales and Price. A multiple regression model was built to predict inventory levels, and prediction intervals are used to guide optimized inventory for peak sales. Overall, the results provide a starting point for inventory planning, sales forecasting, and pricing strategy.

Load Libraries

library(tidyverse)
library(ggplot2)
library(MASS)
library(dplyr)
library(broom)
library(Matrix)
library(car)
library(lmtest)

Problem 1: Business Risk and Revenue Modeling

# Download provided data set
retail_raw <- read_csv("https://raw.githubusercontent.com/evanskaylie/DATA605/refs/heads/main/synthetic_retail_data.csv")

# Save the data 
retail_data <- retail_raw

# Take a look
head(retail_data)
## # A tibble: 6 × 6
##   Product_ID Sales Inventory_Levels Lead_Time_Days Price Seasonality_Index
##        <dbl> <dbl>            <dbl>          <dbl> <dbl>             <dbl>
## 1          1  158.             367.           6.31  18.8             1.18 
## 2          2  279.             427.           5.80  26.1             0.857
## 3          3  699.             408.           3.07  22.4             0.699
## 4          4 1832.             392.           3.53  27.1             0.698
## 5          5  460.             448.          10.8   18.3             0.841
## 6          6 1693.             547.          10.1   23.5             1.13
summary(retail_data)
##    Product_ID         Sales         Inventory_Levels Lead_Time_Days  
##  Min.   :  1.00   Min.   :  25.57   Min.   : 67.35   Min.   : 0.491  
##  1st Qu.: 50.75   1st Qu.: 284.42   1st Qu.:376.51   1st Qu.: 5.291  
##  Median :100.50   Median : 533.54   Median :483.72   Median : 6.765  
##  Mean   :100.50   Mean   : 636.92   Mean   :488.55   Mean   : 6.834  
##  3rd Qu.:150.25   3rd Qu.: 867.58   3rd Qu.:600.42   3rd Qu.: 8.212  
##  Max.   :200.00   Max.   :2447.49   Max.   :858.79   Max.   :12.722  
##      Price        Seasonality_Index
##  Min.   : 5.053   Min.   :0.3305   
##  1st Qu.:16.554   1st Qu.:0.8475   
##  Median :19.977   Median :0.9762   
##  Mean   :19.560   Mean   :0.9829   
##  3rd Qu.:22.924   3rd Qu.:1.1205   
##  Max.   :29.404   Max.   :1.5958

Part 1: Empirical and Theoretical Analysis of Distributions

1. Generate and Analyze Distributions:

X ~ Sales

# Sales (X) - Gamma Distribution
sales_fit <- try(fitdistr(retail_data$Sales, "gamma"), silent = TRUE)
summary(sales_fit)
##          Length Class  Mode   
## estimate 2      -none- numeric
## sd       2      -none- numeric
## vcov     4      -none- numeric
## loglik   1      -none- numeric
## n        1      -none- numeric
# Visual check
sales_params <- sales_fit$estimate
qqplot(qgamma(ppoints(retail_data$Sales), 
              sales_params["shape"], sales_params["rate"]),
       retail_data$Sales,
       main = "Q-Q Plot: Sales vs Gamma")
abline(0,1)

The Gamma distribution fits the Sales data well. The theoretical mean and variance from the Gamma parameters are very close to the empirical mean and variance. This suggests the Gamma model is a good choice for describing sales variability. The graph also shows the data is slightly right-skewed and has a long tail..

Y ~ Inventory Levels

# Inventory Levels (Y) - Lognormal Distribution
inventory_levels_fit <- fitdistr(retail_data$Inventory_Levels, "lognormal")
summary(inventory_levels_fit)
##          Length Class  Mode   
## estimate 2      -none- numeric
## sd       2      -none- numeric
## vcov     4      -none- numeric
## n        1      -none- numeric
## loglik   1      -none- numeric
# Visual check
inventory_params <- inventory_levels_fit$estimate
qqplot(qlnorm(ppoints(retail_data$Inventory_Levels), 
              inventory_params["meanlog"], inventory_params["sdlog"]),
       retail_data$Inventory_Levels,
       main = "Q-Q Plot: Inventory Levels vs Lognormal")
abline(0,1)

The Lognormal distribution is reasonable for Inventory Levels because the values are positive and right-skewed. The theoretical mean and variance are slightly higher than the empirical values, but they are still in the same range. This means the Lognormal distribution captures the general shape of the data but might slightly overestimate variability. This makes sense in practice because inventory levels often grow multiplicatively and vary greatly by product.

Z ~ Lead Time

# Lead Time (Z) - Normal Distribution
lead_time_fit <- fitdistr(retail_data$Lead_Time_Days, "normal")
summary(lead_time_fit)
##          Length Class  Mode   
## estimate 2      -none- numeric
## sd       2      -none- numeric
## vcov     4      -none- numeric
## n        1      -none- numeric
## loglik   1      -none- numeric
# Visual check
leadtime_params <- lead_time_fit$estimate
qqplot(qnorm(ppoints(retail_data$Lead_Time_Days), 
              leadtime_params["mean"], leadtime_params["sd"]),
       retail_data$Lead_Time_Days,
       main = "Q-Q Plot: Lead Time vs Normal")
abline(0,1)

The Lead Time data fits the Normal distribution well. The theoretical mean and variance are almost exactly the same as the empirical values. This suggests the Normal distribution is appropriate for modeling lead times. It also implies that lead times are fairly stable and symmetric around their average.

Parameters

##                      Variable   Param1      Param2
## shape           Sales (Gamma) 1.834964 0.002881017
## meanlog Inventory (Lognormal) 6.133037 0.363327270
## mean       Lead Time (Normal) 6.834298 2.083213697

The estimated parameters describe the shape and spread of each distribution.

  • For Sales, the Gamma distribution has a shape of about 1.83 and a rate of about 0.0029, which indicates a right-skewed sales pattern with a long tail. This matches the Q-Q plot’s analysis.
  • For Inventory Levels, the Lognormal distribution has a meanlog of about 6.13 and an sdlog of about 0.36, showing that inventory values grow on a multiplicative scale and have moderate variability.
  • For Lead Time, the Normal distribution has a mean of about 6.83 days and a standard deviation of about 2.08 days, which shows that most lead times cluster near one week with a fairly small amount of variation.

These parameters give a clear summary of how each variable behaves and help support forecasting and planning decisions.

2. Calculate Empirical Expected Value and Variance:

# Theoretical mean and variance
theoretical_stats <- data.frame(
  Variable = c("Sales", "Inventory Levels", "Lead Time"),
  
  Mean = c(
    sales_params["shape"] / sales_params["rate"],
    exp(inventory_params["meanlog"] + (inventory_params["sdlog"]^2)/2),
    leadtime_params["mean"]
  ),
  
  Variance = c(
    sales_params["shape"] / (sales_params["rate"]^2),
    (exp(inventory_params["sdlog"]^2) - 1) * 
      exp(2 * inventory_params["meanlog"] + (inventory_params["sdlog"]^2)),
    leadtime_params["sd"]^2
  )
)

# Empirical mean and variance
empirical_stats <- data.frame(
  Variable = c("Sales", "Inventory Levels", "Lead Time"),
  Mean = c(mean(retail_data$Sales), mean(retail_data$Inventory_Levels), mean(retail_data$Lead_Time_Days)),
  Variance = c(var(retail_data$Sales), var(retail_data$Inventory_Levels), var(retail_data$Lead_Time_Days))
)

# Compare results
comparison_table <- data.frame(
  Variable = c("Sales", "Inventory Levels", "Lead Time"),
  Empirical_Mean = empirical_stats$Mean,
  Empirical_Var = empirical_stats$Variance,
  Theoretical_Mean = theoretical_stats$Mean,
  Theoretical_Var = theoretical_stats$Variance
)

# View results
comparison_table
##           Variable Empirical_Mean Empirical_Var Theoretical_Mean
## 1            Sales     636.916210  2.148318e+05       636.915491
## 2 Inventory Levels     488.547176  2.403945e+04       492.276304
## 3        Lead Time       6.834298  4.361587e+00         6.834298
##   Theoretical_Var
## 1    2.210732e+05
## 2    3.419747e+04
## 3    4.339779e+00

The empirical and theoretical means and variances match closely for all three variables. This shows that the estimated distributions describe the data well. The small differences come from sampling variation and natural noise in retail operations. Overall, the distributional assumptions are reasonable and supported by the data.

Part 2: Probability Analysis and Independence Testing

1. Empirical Probabilities:

Get values

# Get mu and sigma
mu <- mean(retail_data$Lead_Time_Days)
sigma <- sd(retail_data$Lead_Time_Days)

# Base probabilities
p_greater_mu       <- 1 - pnorm(mu, mean = mu, sd = sigma)
p_greater_mu_minus <- 1 - pnorm(mu - sigma, mean = mu, sd = sigma)
p_greater_mu_plus  <- 1 - pnorm(mu + sigma, mean = mu, sd = sigma)
p_greater_mu_plus2 <- 1 - pnorm(mu + 2*sigma, mean = mu, sd = sigma)

P( Z > mu | Z > mu - sigma )

# Get first probability based on given formula
prob_a <- p_greater_mu / p_greater_mu_minus

P( Z > mu + sigma | Z > mu )

# Get second probability based on given formula
prob_b <- p_greater_mu_plus / p_greater_mu

P( Z > mu + 2 * sigma | Z > mu )

# Get third probability based on given formula
prob_c <- p_greater_mu_plus2 / p_greater_mu

All probabilities

##                     Probability Value
## 1  P( Z > mu | Z > mu - sigma ) 0.594
## 2  P( Z > mu + sigma | Z > mu ) 0.317
## 3 P( Z > mu + 2sigma | Z > mu ) 0.046

These results show how likely lead times are to exceed higher thresholds once they are already above certain levels. About 59% of lead times above (mu - sigma) also exceed the mean. If a lead time is above the mean, only 32% go beyond (mu + sigma), and just 5% go beyond (mu + 2sigma). As the cutoff increases, the chance of exceeding it drops quickly.

2. Correlation and Independence:

# Correlation between Sales and Price
correlation <- cor(retail_data$Sales, retail_data$Price)
cat("Correlation between Sales and Price:", round(correlation, 3), "\n")
## Correlation between Sales and Price: 0.103
# Create quartiles to avoid sparse cells
retail_data$Sales_Quartile <- cut(
  retail_data$Sales,
  quantile(retail_data$Sales, probs = seq(0, 1, 0.25)),
  include.lowest = TRUE
)
retail_data$Price_Quartile <- cut(
  retail_data$Price,
  quantile(retail_data$Price, probs = seq(0, 1, 0.25)),
  include.lowest = TRUE
)

# Contingency table
contingency_table <- table(retail_data$Sales_Quartile, retail_data$Price_Quartile)
contingency_table
##                 
##                  [5.05,16.6] (16.6,20] (20,22.9] (22.9,29.4]
##   [25.6,284]              11        16        12          11
##   (284,534]               13        10        15          12
##   (534,868]               15        10        13          12
##   (868,2.45e+03]          11        14        10          15

The correlation between Sales and Price is quite weak.

The contingency table shows how many products fall into each combination of Sales quartile (rows) and Price quartile (columns). Each cell counts the number of products that are in that specific Sales and Price range. The contingency table of quartiles shows that counts, marginal probabilities, and joint probabilities are all spread pretty evenly across categories. This even spread means there may be no meaningful association between the two variables.

# Fisher's Exact Test with increased workspace
fisher_test <- fisher.test(contingency_table, workspace = 2e7)

# Chi-Square Test
chisq_test <- chisq.test(contingency_table)

list(Fisher_Test = fisher_test, Chi_Square_Test = chisq_test)
## $Fisher_Test
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.8637
## alternative hypothesis: two.sided
## 
## 
## $Chi_Square_Test
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 4.8, df = 9, p-value = 0.8514

Both Fisher’s Exact Test and the Chi-Square Test return large p-values, confirming that Sales and Price are statistically independent. Overall, price levels do not appear to influence sales patterns in this dataset.

Problem 2: Advanced Forecasting and Optimization (Calculus) in Retail

# Select only the desired columns for this section
retail_new <- retail_raw |>
  dplyr::select(Inventory_Levels, Sales, Lead_Time_Days, Price)

# View the first few rows
head(retail_new)
## # A tibble: 6 × 4
##   Inventory_Levels Sales Lead_Time_Days Price
##              <dbl> <dbl>          <dbl> <dbl>
## 1             367.  158.           6.31  18.8
## 2             427.  279.           5.80  26.1
## 3             408.  699.           3.07  22.4
## 4             392. 1832.           3.53  27.1
## 5             448.  460.          10.8   18.3
## 6             547. 1693.          10.1   23.5

Part 1: Descriptive and Inferential Statistics for Inventory Data

1. Inventory Data Analysis:

Univariate descriptive statistics for Inventory_Levels and Sales

# Summary statistics for Inventory_Levels and Sales
retail_new |>
  dplyr::select(Inventory_Levels, Sales) |>
  summary()
##  Inventory_Levels     Sales        
##  Min.   : 67.35   Min.   :  25.57  
##  1st Qu.:376.51   1st Qu.: 284.42  
##  Median :483.72   Median : 533.54  
##  Mean   :488.55   Mean   : 636.92  
##  3rd Qu.:600.42   3rd Qu.: 867.58  
##  Max.   :858.79   Max.   :2447.49
# Optional: also compute mean, median, sd explicitly
retail_new |>
  summarise(
    Inventory_Mean = mean(Inventory_Levels),
    Inventory_SD = sd(Inventory_Levels),
    Sales_Mean = mean(Sales),
    Sales_SD = sd(Sales)
  )
## # A tibble: 1 × 4
##   Inventory_Mean Inventory_SD Sales_Mean Sales_SD
##            <dbl>        <dbl>      <dbl>    <dbl>
## 1           489.         155.       637.     463.

Visualizations for Inventory_Levels, Sales, and Price

# Histogram of Inventory_Levels
ggplot(retail_new, aes(x = Inventory_Levels)) +
  geom_histogram(binwidth = 50, fill = "darkorchid4", color = "gray89") +
  labs(title = "Histogram of Inventory Levels", x = "Inventory Levels", y = "Count") +
  theme_minimal()

# Histogram of Sales
ggplot(retail_new, aes(x = Sales)) +
  geom_histogram(binwidth = 200, fill = "salmon", color = "gray89") +
  labs(title = "Histogram of Sales", x = "Sales", y = "Count") +
  theme_minimal()

# Histogram of Price
ggplot(retail_new, aes(x = Price)) +
  geom_histogram(binwidth = 2, fill = "lightskyblue2", color = "gray89") +
  labs(title = "Histogram of Price", x = "Price", y = "Count") +
  theme_minimal()

# Scatterplot of Sales vs Inventory_Levels
ggplot(retail_new, aes(x = Inventory_Levels, y = Sales)) +
  geom_point(color = "plum3", alpha=0.7) +
  geom_smooth(method = "lm", color = "darkred", se = TRUE) +
  labs(title = "Sales vs Inventory Levels", x = "Inventory Levels", y = "Sales") +
  theme_minimal()

# Scatterplot of Sales vs Price
ggplot(retail_new, aes(x = Price, y = Sales)) +
  geom_point(color = "aquamarine3", alpha=0.7) +
  geom_smooth(method = "lm", color = "darkred", se = TRUE) +
  labs(title = "Sales vs Price", x = "Price", y = "Sales") +
  theme_minimal()

# Scatterplot of Inventory_Levels vs Price
ggplot(retail_new, aes(x = Price, y = Inventory_Levels)) +
  geom_point(color = "purple3", alpha=0.7) +
  geom_smooth(method = "lm", color = "darkred", se = TRUE) +
  labs(title = "Inventory Levels vs Price", x = "Price", y = "Inventory Levels") +
  theme_minimal()

Histograms:

  • Sales is right-skewed

  • Inventory_Level is right-skewed

  • Price is roughly symmetric

The scatterplots with regression lines confirm the correlation results: there is no strong linear trend between Sales and Inventory_Levels, Sales and Price, or Inventory_Levels and Price. Overall, the weak correlations and mostly flat regression lines suggest that variations in Price or Inventory_Levels have little direct impact on Sales in this dataset.

These patterns support the conclusion that multicollinearity is not a concern, and inventory decisions should consider other factors like seasonality or marketing rather than relying solely on Price or stock levels.

Correlation matrix for Sales, Price, and Inventory_Levels

# Select numeric columns
numeric_vars <- retail_new |> 
  dplyr::select(Sales, Price, Inventory_Levels)

# Compute correlation matrix
cor_matrix <- cor(retail_new[, c("Sales", "Price", "Inventory_Levels")])
cor_matrix
##                        Sales       Price Inventory_Levels
## Sales             1.00000000  0.10272730      -0.03529619
## Price             0.10272730  1.00000000      -0.04025941
## Inventory_Levels -0.03529619 -0.04025941       1.00000000

The correlation analysis shows that Sales and Price have a very weak positive correlation (r = 0.103, p = 0.148), while Sales and Inventory_Levels (r = -0.035, p = 0.620) and Price and Inventory_Levels (r = -0.040, p = 0.571) are both essentially uncorrelated.

Test the hypotheses that the correlations between the variables are zero and provide a 95% confidence interval.

# Function to compute correlation test and 95% CI
cor_test_with_ci <- function(x, y) {
  test <- cor.test(x, y)
  data.frame(
    Variable1 = deparse(substitute(x)),
    Variable2 = deparse(substitute(y)),
    Correlation = test$estimate,
    p_value = test$p.value,
    CI_lower = test$conf.int[1],
    CI_upper = test$conf.int[2]
  )
}

# Apply to all pairs
test_results <- rbind(
  cor_test_with_ci(retail_new$Sales, retail_new$Price),
  cor_test_with_ci(retail_new$Sales, retail_new$Inventory_Levels),
  cor_test_with_ci(retail_new$Price, retail_new$Inventory_Levels)
)

test_results
##             Variable1                   Variable2 Correlation   p_value
## cor  retail_new$Sales            retail_new$Price  0.10272730 0.1477542
## cor1 retail_new$Sales retail_new$Inventory_Levels -0.03529619 0.6197608
## cor2 retail_new$Price retail_new$Inventory_Levels -0.04025941 0.5713837
##         CI_lower   CI_upper
## cor  -0.03653442 0.23807516
## cor1 -0.17318909 0.10395389
## cor2 -0.17800614 0.09903478

The 95% confidence intervals for all correlations include zero, confirming that these relationships are not statistically significant. These results suggest that changes in Price or Inventory_Levels are unlikely to strongly impact Sales in this dataset.

2. Discussion:

From an inventory management perspective, there is no evidence of a strong link between inventory levels and sales, so decisions about stocking may need to rely on other factors such as seasonality or demand forecasts, which are not included here. Additionally, because all predictor correlations are very low, multicollinearity would not be a concern in a potential regression model.

Part 2: Linear Algebra and Pricing Strategy

1. Price Elasticity of Demand:

# Linear regression
reg_model <- lm(Sales ~ Price, data = retail_new)
reg_tidy <- tidy(reg_model)
reg_tidy
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   443.      137.        3.22 0.00148
## 2 Price           9.92      6.82      1.45 0.148
# Correlation matrix 
cor_matrix |>
  as.data.frame() |>
  rownames_to_column("Variable") |>
  pivot_longer(-Variable, names_to = "Compare", values_to = "Correlation")
## # A tibble: 9 × 3
##   Variable         Compare          Correlation
##   <chr>            <chr>                  <dbl>
## 1 Sales            Sales                 1     
## 2 Sales            Price                 0.103 
## 3 Sales            Inventory_Levels     -0.0353
## 4 Price            Sales                 0.103 
## 5 Price            Price                 1     
## 6 Price            Inventory_Levels     -0.0403
## 7 Inventory_Levels Sales                -0.0353
## 8 Inventory_Levels Price                -0.0403
## 9 Inventory_Levels Inventory_Levels      1
# Precision matrix (inverse of correlation)
precision_matrix <- solve(cor_matrix)
precision_tidy <- as.data.frame(precision_matrix) |>
  rownames_to_column("Variable") |>
  pivot_longer(-Variable, names_to = "Compare", values_to = "Precision")
precision_tidy
## # A tibble: 9 × 3
##   Variable         Compare          Precision
##   <chr>            <chr>                <dbl>
## 1 Sales            Sales               1.01  
## 2 Sales            Price              -0.103 
## 3 Sales            Inventory_Levels    0.0316
## 4 Price            Sales              -0.103 
## 5 Price            Price               1.01  
## 6 Price            Inventory_Levels    0.0371
## 7 Inventory_Levels Sales               0.0316
## 8 Inventory_Levels Price               0.0371
## 9 Inventory_Levels Inventory_Levels    1.00
# LU decomposition of correlation matrix
lu_decomp <- lu(cor_matrix)

# Extract L and U as base R matrices
L <- as.matrix(expand(lu_decomp)$L)
U <- as.matrix(expand(lu_decomp)$U)

# Convert to tidy data frames
L_tidy <- L |>
  as.data.frame() |>
  tibble::rownames_to_column("Variable") |>
  pivot_longer(-Variable, names_to = "Compare", values_to = "L_Value")
U_tidy <- U |>
  as.data.frame() |>
  tibble::rownames_to_column("Variable") |>
  pivot_longer(-Variable, names_to = "Compare", values_to = "U_Value")

# Correlation tests with tidy output
cor_tests <- list(
  Sales_Price = cor.test(retail_new$Sales, retail_new$Price),
  Sales_Inventory = cor.test(retail_new$Sales, retail_new$Inventory_Levels),
  Price_Inventory = cor.test(retail_new$Price, retail_new$Inventory_Levels)
)

cor_tests_tidy <- map_dfr(cor_tests, ~{
  tibble(
    Estimate = .$estimate,
    t_value  = .$statistic,
    df       = .$parameter,
    p_value  = .$p.value,
    CI_lower = .$conf.int[1],
    CI_upper = .$conf.int[2]
  )
}, .id = "Comparison")

cor_tests_tidy
## # A tibble: 3 × 7
##   Comparison      Estimate t_value    df p_value CI_lower CI_upper
##   <chr>              <dbl>   <dbl> <int>   <dbl>    <dbl>    <dbl>
## 1 Sales_Price       0.103    1.45    198   0.148  -0.0365   0.238 
## 2 Sales_Inventory  -0.0353  -0.497   198   0.620  -0.173    0.104 
## 3 Price_Inventory  -0.0403  -0.567   198   0.571  -0.178    0.0990

Here, the relationship between Sales, Price, and Inventory Levels is analyzed using regression, correlation, and matrix decomposition methods. The linear regression of Sales on Price shows a very weak relationship (R-squared = 0.0106, p = 0.148), meaning that Price explains almost none of the variation in Sales. This is unexpected, as price and sales are often directly linked. The diagonal elements of the precision matrix are close to 1, suggesting minimal multicollinearity among Sales, Price, and Inventory_Levels. LU decomposition confirms the correlation matrix is numerically stable and well-conditioned, but reveals no additional dependencies. Overall, Price has limited influence on Sales, so other factors should be considered to explain sales patterns.

In this dataset, the weak regression coefficient and low R-squared suggest that Sales are relatively inelastic with respect to Price, meaning small changes in Price have minimal effect on demand. Pricing alone won’t move sales. Other strategies are needed to drive the revenue.

Part 3: Calculus-Based Probability & Statistics for Sales Forecasting

1. Sales Forecasting Using Exponential Distribution:

# Fit exponential to Sales
exp_fit <- fitdistr(retail_new$Sales, "exponential")
rate_hat <- as.numeric(exp_fit$estimate["rate"])
exp_fit
##        rate    
##   0.0015700652 
##  (0.0001110204)
# Generate 1,000 samples from fitted exponential
set.seed(64)
samples <- rexp(1000, rate = rate_hat)

# Prep data frames for plotting
df_orig <- tibble(Sales = retail_new$Sales, Type = "Original")
df_sim  <- tibble(Sales = samples, Type = "Exponential Sim")
df_both <- bind_rows(df_orig, df_sim)

# Plot an overlaid histogram
ggplot(df_both, aes(x = Sales, fill = Type)) +
  geom_histogram(position = "identity", alpha = 0.55, bins = 60, color = "gray70") +
  scale_fill_manual(values = c("Original" = "aquamarine",
                               "Exponential Sim" = "salmon")) +
  labs(title = "Original Sales vs. Simulated Exponential Samples",
       x = "Sales", y = "Count") +
  theme_minimal()

# Theoretical percentiles from exponential CDF 
theor_pct <- qexp(c(0.05, 0.95), rate = rate_hat) |> set_names(c("5%", "95%"))

# Empirical percentiles
emp_pct <- quantile(retail_new$Sales, probs = c(0.05, 0.95)) |> set_names(c("5%", "95%"))

# 95% CI for the mean assuming normality
n <- length(retail_new$Sales)
xbar <- mean(retail_new$Sales)
s <- sd(retail_new$Sales)
tcrit <- qt(0.975, df = n - 1)
ci_mean <- c(xbar - tcrit * s / sqrt(n), xbar + tcrit * s / sqrt(n)) |> 
  set_names(c("CI_lower", "CI_upper"))

# Print tidy results
list(
  Fit = exp_fit,
  Theoretical_percentiles = theor_pct,
  Empirical_percentiles = emp_pct,
  Mean_and_95CI = c(mean = xbar, ci_mean)
)
## $Fit
##        rate    
##   0.0015700652 
##  (0.0001110204)
## 
## $Theoretical_percentiles
##         5%        95% 
##   32.66953 1908.03045 
## 
## $Empirical_percentiles
##        5%       95% 
##  104.9028 1502.2498 
## 
## $Mean_and_95CI
##     mean CI_lower CI_upper 
## 636.9162 572.2866 701.5458

I fit an exponential distribution to the right-skewed Sales data, simulated 1,000 samples from the fitted exponential, and plotted those samples together with the original Sales histogram for visual comparison. I computed the 5th and 95th percentiles from the fitted exponential and compared them to the empirical 5th/95th percentiles from the data. I also calculated a 95% confidence interval for the mean (assuming normality) to compare with the empirical percentiles. The exponential fit produced a rate of 0.00157, and the simulated data show a typical exponential shape, strong right skew, and many small values. The theoretical percentiles (5% = 32.7, 95% = 1908.0) differ noticeably from the empirical percentiles of the real data (5% = 104.9, 95% = 1502.2). This mismatch suggests the exponential model does not capture the true spread or tail behavior of the sales data very well. The normal-based 95% confidence interval for the mean (572.3 to 701.5) lands nicely around the actual mean (636.9), but normality assumptions do not hold for strongly skewed data. This CI is less meaningful for forecasting actual sales values.

2. Discussion:

The exponential distribution does not closely model the data. The comparison histogram shows the difference between the original Sales and the simulated exponential data distributions. The simulated exponential tends to put too much mass near zero and declines too quickly compared to the real Sales data, which has a heavier right tail. The simulated exponential percentiles (33 and 1908) were much wider and more extreme than the empirical percentiles (105 and 1502), showing that the exponential distribution overestimates the lower tail and over-stretches the upper tail. This mismatch occurs because the data are right-skewed but not nearly as sharply as an exponential distribution assumes. For forecasting, this means an exponential model would generate too many very small values and too many extreme large values, making it unreliable for predicting actual future sales. My suggestion would be to use a log-normal or gamma distribution to try and provide a better fit for this type of skewed data.

## # A tibble: 3 × 6
##   Dist         Param1   Param2   AIC KS_stat KS_pvalue
##   <chr>         <dbl>    <dbl> <dbl>   <dbl>     <dbl>
## 1 Exponential 0.00157 NA       2985.  0.140   0.000783
## 2 Lognormal   6.16     0.846   2969.  0.0711  0.264   
## 3 Gamma       1.83     0.00288 2950.  0.0428  0.857

I fit Exponential, Lognormal, and Gamma distributions to the Sales data using maximum likelihood. I compared fits by plotting the empirical histogram with the three fitted density curves and by computing AIC and KS statistics. I also calculated the 5th and 95th percentiles for the Lognormal and Gamma fits for direct comparison with the empirical percentiles. The Exponential fails to match the data, but both the Lognormal and Gamma fit much better. AIC and KS tests show the Gamma distribution is the best overall fit, capturing both the main body and the right tail most accurately. Its percentiles are closest to the empirical values, meaning it reflects the actual sales range more realistically. For forecasting, the Gamma is the most reliable choice, with the Lognormal as a reasonable backup.

Part 4: Regression Modeling for Inventory Optimization

1. Multiple Regression Model:

# Multiple Regression Model
multi_reg_model <- lm(Inventory_Levels ~ Sales + Lead_Time_Days + Price, data = retail_new)

# Tidy summary + model-level stats
model_coef <- tidy(multi_reg_model)
model_glance <- glance(multi_reg_model)
model_coef
## # A tibble: 4 × 5
##   term            estimate std.error statistic  p.value
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    465.        61.3        7.58  1.35e-12
## 2 Sales           -0.00781    0.0240    -0.326 7.45e- 1
## 3 Lead_Time_Days   7.32       5.29       1.38  1.68e- 1
## 4 Price           -1.09       2.31      -0.472 6.38e- 1
model_glance
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1    0.0122      -0.00289  155.     0.809   0.490     3 -1291. 2592. 2608.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

I fit a multiple linear model predicting Inventory_Levels from Sales, Lead_Time_Days, and Price (coefficients and 95% CIs are shown). The R-squared is very low, so these predictors explain almost none of the variance in inventory. Coefficients have high uncertainty, suggesting other factors like seasonality or product-level differences are likely more important.

# Quick residual diagnostics
par(mfrow = c(1,2))
plot(multi_reg_model, which = 1) # Residuals vs Fitted
plot(multi_reg_model, which = 2) # Normal Q-Q

par(mfrow = c(1,1))

# Numerical diagnostics
resid_sd <- sigma(multi_reg_model)
vif_vals <- vif(multi_reg_model)
shapiro_p <- shapiro.test(residuals(multi_reg_model))$p.value

# Results
list(residual_sd = resid_sd, VIF = vif_vals, shapiro_p = shapiro_p)
## $residual_sd
## [1] 155.2702
## 
## $VIF
##          Sales Lead_Time_Days          Price 
##       1.017562       1.008633       1.011822 
## 
## $shapiro_p
## [1] 0.4239664

I checked residuals (residual SD ≈ 155), normality (Shapiro p ≈ 0.424), and plotted Residuals vs Fitted and a Q–Q plot. VIFs are ~1, so multicollinearity is not an issue, and no strong skew in residuals was detected, but formal tests (Breusch–Pagan, Durbin–Watson) should be used if time series or grouped structure exists. Overall, plots show no severe violations, but the model fit is weak, so predictions should be interpreted cautiously.

2. Optimization:

# Example peak scenarios
scenarios <- tibble(
  Sales = c(1000, 1500),
  Lead_Time_Days = c(5, 7),
  Price = c(20, 25)
)

# Predict with 95% prediction interval
pred_tbl <- predict(multi_reg_model, newdata = scenarios, interval = "prediction", level = 0.95)
pred_tbl <- as_tibble(pred_tbl) |> setNames(c("Fit", "Lower", "Upper"))

# Bind predictions and compute upper-bound inventory
scenarios <- scenarios |> 
  bind_cols(pred_tbl) |> 
  mutate(Optimized_Inventory = round(Upper, 1))

# View results
scenarios
## # A tibble: 2 × 7
##   Sales Lead_Time_Days Price   Fit Lower Upper Optimized_Inventory
##   <dbl>          <dbl> <dbl> <dbl> <dbl> <dbl>               <dbl>
## 1  1000              5    20  472.  164.  780.                780.
## 2  1500              7    25  477.  167.  787.                788.

Using the multiple regression model, I generated predictions for two peak sales scenarios. The 95% prediction interval accounts for uncertainty in inventory levels. The upper bound of the interval is used to recommend “optimized” inventory to ensure sufficient stock. This provides a baseline, though additional factors like seasonality or product-specific demand should also be considered.

# Summary table
list(
  Coefficients = model_coef,
  Model_Summary = model_glance,
  Diagnostics = list(residual_sd = resid_sd, VIF = vif_vals, shapiro_p = shapiro_p),
  Example_Optimized = scenarios
)
## $Coefficients
## # A tibble: 4 × 5
##   term            estimate std.error statistic  p.value
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    465.        61.3        7.58  1.35e-12
## 2 Sales           -0.00781    0.0240    -0.326 7.45e- 1
## 3 Lead_Time_Days   7.32       5.29       1.38  1.68e- 1
## 4 Price           -1.09       2.31      -0.472 6.38e- 1
## 
## $Model_Summary
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1    0.0122      -0.00289  155.     0.809   0.490     3 -1291. 2592. 2608.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## 
## $Diagnostics
## $Diagnostics$residual_sd
## [1] 155.2702
## 
## $Diagnostics$VIF
##          Sales Lead_Time_Days          Price 
##       1.017562       1.008633       1.011822 
## 
## $Diagnostics$shapiro_p
## [1] 0.4239664
## 
## 
## $Example_Optimized
## # A tibble: 2 × 7
##   Sales Lead_Time_Days Price   Fit Lower Upper Optimized_Inventory
##   <dbl>          <dbl> <dbl> <dbl> <dbl> <dbl>               <dbl>
## 1  1000              5    20  472.  164.  780.                780.
## 2  1500              7    25  477.  167.  787.                788.

The regression explains almost none of the inventory variance and no predictor is significant. Multicollinearity is low, but missing variables likely drive inventory more than Sales, Price, or Lead Time. The prediction intervals from the model give a principled baseline, but final inventory levels should rely on contextual adjustments and business rules.

Conclusion

This analysis explored retail sales, inventory levels, lead times, and pricing to support data-driven decision-making. I fitted theoretical distributions to key variables, showing that Sales are best modeled by a Gamma distribution, Inventory Levels by a Lognormal distribution, and Lead Times by a Normal distribution. Empirical and theoretical statistics matched closely, confirming the appropriateness of these models.

Correlation and independence tests revealed weak relationships between Sales, Price, and Inventory Levels, suggesting that pricing alone does not drive sales, and inventory decisions depend on additional factors like seasonality or demand patterns. Multiple regression for inventory prediction showed low explanatory power, indicating that other variables not captured here are likely more important. Prediction intervals provide a practical baseline for inventory planning under uncertainty, but business context and operational rules remain critical.

Overall, the project demonstrates how statistical modeling and regression analysis can inform retail decisions, identify key drivers of variability, and guide inventory optimization, while highlighting the limitations of relying on a small set of predictors.