The following RMD contains CUNY SPS DATA 605 Fall 2025 Final Exam assignment. This report contains two problems with labeled parts to each of them. This analysis examines retail sales, inventory, lead time, and pricing to support data-driven decisions. I fit statistical distributions to key variables and checked correlations and independence between Sales and Price. A multiple regression model was built to predict inventory levels, and prediction intervals are used to guide optimized inventory for peak sales. Overall, the results provide a starting point for inventory planning, sales forecasting, and pricing strategy.
library(tidyverse)
library(ggplot2)
library(MASS)
library(dplyr)
library(broom)
library(Matrix)
library(car)
library(lmtest)
# Download provided data set
retail_raw <- read_csv("https://raw.githubusercontent.com/evanskaylie/DATA605/refs/heads/main/synthetic_retail_data.csv")
# Save the data
retail_data <- retail_raw
# Take a look
head(retail_data)
## # A tibble: 6 × 6
## Product_ID Sales Inventory_Levels Lead_Time_Days Price Seasonality_Index
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 158. 367. 6.31 18.8 1.18
## 2 2 279. 427. 5.80 26.1 0.857
## 3 3 699. 408. 3.07 22.4 0.699
## 4 4 1832. 392. 3.53 27.1 0.698
## 5 5 460. 448. 10.8 18.3 0.841
## 6 6 1693. 547. 10.1 23.5 1.13
summary(retail_data)
## Product_ID Sales Inventory_Levels Lead_Time_Days
## Min. : 1.00 Min. : 25.57 Min. : 67.35 Min. : 0.491
## 1st Qu.: 50.75 1st Qu.: 284.42 1st Qu.:376.51 1st Qu.: 5.291
## Median :100.50 Median : 533.54 Median :483.72 Median : 6.765
## Mean :100.50 Mean : 636.92 Mean :488.55 Mean : 6.834
## 3rd Qu.:150.25 3rd Qu.: 867.58 3rd Qu.:600.42 3rd Qu.: 8.212
## Max. :200.00 Max. :2447.49 Max. :858.79 Max. :12.722
## Price Seasonality_Index
## Min. : 5.053 Min. :0.3305
## 1st Qu.:16.554 1st Qu.:0.8475
## Median :19.977 Median :0.9762
## Mean :19.560 Mean :0.9829
## 3rd Qu.:22.924 3rd Qu.:1.1205
## Max. :29.404 Max. :1.5958
X ~ Sales
# Sales (X) - Gamma Distribution
sales_fit <- try(fitdistr(retail_data$Sales, "gamma"), silent = TRUE)
summary(sales_fit)
## Length Class Mode
## estimate 2 -none- numeric
## sd 2 -none- numeric
## vcov 4 -none- numeric
## loglik 1 -none- numeric
## n 1 -none- numeric
# Visual check
sales_params <- sales_fit$estimate
qqplot(qgamma(ppoints(retail_data$Sales),
sales_params["shape"], sales_params["rate"]),
retail_data$Sales,
main = "Q-Q Plot: Sales vs Gamma")
abline(0,1)
The Gamma distribution fits the Sales data well. The theoretical mean and variance from the Gamma parameters are very close to the empirical mean and variance. This suggests the Gamma model is a good choice for describing sales variability. The graph also shows the data is slightly right-skewed and has a long tail..
Y ~ Inventory Levels
# Inventory Levels (Y) - Lognormal Distribution
inventory_levels_fit <- fitdistr(retail_data$Inventory_Levels, "lognormal")
summary(inventory_levels_fit)
## Length Class Mode
## estimate 2 -none- numeric
## sd 2 -none- numeric
## vcov 4 -none- numeric
## n 1 -none- numeric
## loglik 1 -none- numeric
# Visual check
inventory_params <- inventory_levels_fit$estimate
qqplot(qlnorm(ppoints(retail_data$Inventory_Levels),
inventory_params["meanlog"], inventory_params["sdlog"]),
retail_data$Inventory_Levels,
main = "Q-Q Plot: Inventory Levels vs Lognormal")
abline(0,1)
The Lognormal distribution is reasonable for Inventory Levels because the values are positive and right-skewed. The theoretical mean and variance are slightly higher than the empirical values, but they are still in the same range. This means the Lognormal distribution captures the general shape of the data but might slightly overestimate variability. This makes sense in practice because inventory levels often grow multiplicatively and vary greatly by product.
Z ~ Lead Time
# Lead Time (Z) - Normal Distribution
lead_time_fit <- fitdistr(retail_data$Lead_Time_Days, "normal")
summary(lead_time_fit)
## Length Class Mode
## estimate 2 -none- numeric
## sd 2 -none- numeric
## vcov 4 -none- numeric
## n 1 -none- numeric
## loglik 1 -none- numeric
# Visual check
leadtime_params <- lead_time_fit$estimate
qqplot(qnorm(ppoints(retail_data$Lead_Time_Days),
leadtime_params["mean"], leadtime_params["sd"]),
retail_data$Lead_Time_Days,
main = "Q-Q Plot: Lead Time vs Normal")
abline(0,1)
The Lead Time data fits the Normal distribution well. The theoretical mean and variance are almost exactly the same as the empirical values. This suggests the Normal distribution is appropriate for modeling lead times. It also implies that lead times are fairly stable and symmetric around their average.
Parameters
## Variable Param1 Param2
## shape Sales (Gamma) 1.834964 0.002881017
## meanlog Inventory (Lognormal) 6.133037 0.363327270
## mean Lead Time (Normal) 6.834298 2.083213697
The estimated parameters describe the shape and spread of each distribution.
These parameters give a clear summary of how each variable behaves and help support forecasting and planning decisions.
# Theoretical mean and variance
theoretical_stats <- data.frame(
Variable = c("Sales", "Inventory Levels", "Lead Time"),
Mean = c(
sales_params["shape"] / sales_params["rate"],
exp(inventory_params["meanlog"] + (inventory_params["sdlog"]^2)/2),
leadtime_params["mean"]
),
Variance = c(
sales_params["shape"] / (sales_params["rate"]^2),
(exp(inventory_params["sdlog"]^2) - 1) *
exp(2 * inventory_params["meanlog"] + (inventory_params["sdlog"]^2)),
leadtime_params["sd"]^2
)
)
# Empirical mean and variance
empirical_stats <- data.frame(
Variable = c("Sales", "Inventory Levels", "Lead Time"),
Mean = c(mean(retail_data$Sales), mean(retail_data$Inventory_Levels), mean(retail_data$Lead_Time_Days)),
Variance = c(var(retail_data$Sales), var(retail_data$Inventory_Levels), var(retail_data$Lead_Time_Days))
)
# Compare results
comparison_table <- data.frame(
Variable = c("Sales", "Inventory Levels", "Lead Time"),
Empirical_Mean = empirical_stats$Mean,
Empirical_Var = empirical_stats$Variance,
Theoretical_Mean = theoretical_stats$Mean,
Theoretical_Var = theoretical_stats$Variance
)
# View results
comparison_table
## Variable Empirical_Mean Empirical_Var Theoretical_Mean
## 1 Sales 636.916210 2.148318e+05 636.915491
## 2 Inventory Levels 488.547176 2.403945e+04 492.276304
## 3 Lead Time 6.834298 4.361587e+00 6.834298
## Theoretical_Var
## 1 2.210732e+05
## 2 3.419747e+04
## 3 4.339779e+00
The empirical and theoretical means and variances match closely for all three variables. This shows that the estimated distributions describe the data well. The small differences come from sampling variation and natural noise in retail operations. Overall, the distributional assumptions are reasonable and supported by the data.
Get values
# Get mu and sigma
mu <- mean(retail_data$Lead_Time_Days)
sigma <- sd(retail_data$Lead_Time_Days)
# Base probabilities
p_greater_mu <- 1 - pnorm(mu, mean = mu, sd = sigma)
p_greater_mu_minus <- 1 - pnorm(mu - sigma, mean = mu, sd = sigma)
p_greater_mu_plus <- 1 - pnorm(mu + sigma, mean = mu, sd = sigma)
p_greater_mu_plus2 <- 1 - pnorm(mu + 2*sigma, mean = mu, sd = sigma)
P( Z > mu | Z > mu - sigma )
# Get first probability based on given formula
prob_a <- p_greater_mu / p_greater_mu_minus
P( Z > mu + sigma | Z > mu )
# Get second probability based on given formula
prob_b <- p_greater_mu_plus / p_greater_mu
P( Z > mu + 2 * sigma | Z > mu )
# Get third probability based on given formula
prob_c <- p_greater_mu_plus2 / p_greater_mu
All probabilities
## Probability Value
## 1 P( Z > mu | Z > mu - sigma ) 0.594
## 2 P( Z > mu + sigma | Z > mu ) 0.317
## 3 P( Z > mu + 2sigma | Z > mu ) 0.046
These results show how likely lead times are to exceed higher thresholds once they are already above certain levels. About 59% of lead times above (mu - sigma) also exceed the mean. If a lead time is above the mean, only 32% go beyond (mu + sigma), and just 5% go beyond (mu + 2sigma). As the cutoff increases, the chance of exceeding it drops quickly.
# Correlation between Sales and Price
correlation <- cor(retail_data$Sales, retail_data$Price)
cat("Correlation between Sales and Price:", round(correlation, 3), "\n")
## Correlation between Sales and Price: 0.103
# Create quartiles to avoid sparse cells
retail_data$Sales_Quartile <- cut(
retail_data$Sales,
quantile(retail_data$Sales, probs = seq(0, 1, 0.25)),
include.lowest = TRUE
)
retail_data$Price_Quartile <- cut(
retail_data$Price,
quantile(retail_data$Price, probs = seq(0, 1, 0.25)),
include.lowest = TRUE
)
# Contingency table
contingency_table <- table(retail_data$Sales_Quartile, retail_data$Price_Quartile)
contingency_table
##
## [5.05,16.6] (16.6,20] (20,22.9] (22.9,29.4]
## [25.6,284] 11 16 12 11
## (284,534] 13 10 15 12
## (534,868] 15 10 13 12
## (868,2.45e+03] 11 14 10 15
The correlation between Sales and Price is quite weak.
The contingency table shows how many products fall into each combination of Sales quartile (rows) and Price quartile (columns). Each cell counts the number of products that are in that specific Sales and Price range. The contingency table of quartiles shows that counts, marginal probabilities, and joint probabilities are all spread pretty evenly across categories. This even spread means there may be no meaningful association between the two variables.
# Fisher's Exact Test with increased workspace
fisher_test <- fisher.test(contingency_table, workspace = 2e7)
# Chi-Square Test
chisq_test <- chisq.test(contingency_table)
list(Fisher_Test = fisher_test, Chi_Square_Test = chisq_test)
## $Fisher_Test
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table
## p-value = 0.8637
## alternative hypothesis: two.sided
##
##
## $Chi_Square_Test
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 4.8, df = 9, p-value = 0.8514
Both Fisher’s Exact Test and the Chi-Square Test return large p-values, confirming that Sales and Price are statistically independent. Overall, price levels do not appear to influence sales patterns in this dataset.
# Select only the desired columns for this section
retail_new <- retail_raw |>
dplyr::select(Inventory_Levels, Sales, Lead_Time_Days, Price)
# View the first few rows
head(retail_new)
## # A tibble: 6 × 4
## Inventory_Levels Sales Lead_Time_Days Price
## <dbl> <dbl> <dbl> <dbl>
## 1 367. 158. 6.31 18.8
## 2 427. 279. 5.80 26.1
## 3 408. 699. 3.07 22.4
## 4 392. 1832. 3.53 27.1
## 5 448. 460. 10.8 18.3
## 6 547. 1693. 10.1 23.5
Univariate descriptive statistics for Inventory_Levels and Sales
# Summary statistics for Inventory_Levels and Sales
retail_new |>
dplyr::select(Inventory_Levels, Sales) |>
summary()
## Inventory_Levels Sales
## Min. : 67.35 Min. : 25.57
## 1st Qu.:376.51 1st Qu.: 284.42
## Median :483.72 Median : 533.54
## Mean :488.55 Mean : 636.92
## 3rd Qu.:600.42 3rd Qu.: 867.58
## Max. :858.79 Max. :2447.49
# Optional: also compute mean, median, sd explicitly
retail_new |>
summarise(
Inventory_Mean = mean(Inventory_Levels),
Inventory_SD = sd(Inventory_Levels),
Sales_Mean = mean(Sales),
Sales_SD = sd(Sales)
)
## # A tibble: 1 × 4
## Inventory_Mean Inventory_SD Sales_Mean Sales_SD
## <dbl> <dbl> <dbl> <dbl>
## 1 489. 155. 637. 463.
Visualizations for Inventory_Levels, Sales, and Price
# Histogram of Inventory_Levels
ggplot(retail_new, aes(x = Inventory_Levels)) +
geom_histogram(binwidth = 50, fill = "darkorchid4", color = "gray89") +
labs(title = "Histogram of Inventory Levels", x = "Inventory Levels", y = "Count") +
theme_minimal()
# Histogram of Sales
ggplot(retail_new, aes(x = Sales)) +
geom_histogram(binwidth = 200, fill = "salmon", color = "gray89") +
labs(title = "Histogram of Sales", x = "Sales", y = "Count") +
theme_minimal()
# Histogram of Price
ggplot(retail_new, aes(x = Price)) +
geom_histogram(binwidth = 2, fill = "lightskyblue2", color = "gray89") +
labs(title = "Histogram of Price", x = "Price", y = "Count") +
theme_minimal()
# Scatterplot of Sales vs Inventory_Levels
ggplot(retail_new, aes(x = Inventory_Levels, y = Sales)) +
geom_point(color = "plum3", alpha=0.7) +
geom_smooth(method = "lm", color = "darkred", se = TRUE) +
labs(title = "Sales vs Inventory Levels", x = "Inventory Levels", y = "Sales") +
theme_minimal()
# Scatterplot of Sales vs Price
ggplot(retail_new, aes(x = Price, y = Sales)) +
geom_point(color = "aquamarine3", alpha=0.7) +
geom_smooth(method = "lm", color = "darkred", se = TRUE) +
labs(title = "Sales vs Price", x = "Price", y = "Sales") +
theme_minimal()
# Scatterplot of Inventory_Levels vs Price
ggplot(retail_new, aes(x = Price, y = Inventory_Levels)) +
geom_point(color = "purple3", alpha=0.7) +
geom_smooth(method = "lm", color = "darkred", se = TRUE) +
labs(title = "Inventory Levels vs Price", x = "Price", y = "Inventory Levels") +
theme_minimal()
Histograms:
Sales is right-skewed
Inventory_Level is right-skewed
Price is roughly symmetric
The scatterplots with regression lines confirm the correlation results: there is no strong linear trend between Sales and Inventory_Levels, Sales and Price, or Inventory_Levels and Price. Overall, the weak correlations and mostly flat regression lines suggest that variations in Price or Inventory_Levels have little direct impact on Sales in this dataset.
These patterns support the conclusion that multicollinearity is not a concern, and inventory decisions should consider other factors like seasonality or marketing rather than relying solely on Price or stock levels.
Correlation matrix for Sales, Price, and Inventory_Levels
# Select numeric columns
numeric_vars <- retail_new |>
dplyr::select(Sales, Price, Inventory_Levels)
# Compute correlation matrix
cor_matrix <- cor(retail_new[, c("Sales", "Price", "Inventory_Levels")])
cor_matrix
## Sales Price Inventory_Levels
## Sales 1.00000000 0.10272730 -0.03529619
## Price 0.10272730 1.00000000 -0.04025941
## Inventory_Levels -0.03529619 -0.04025941 1.00000000
The correlation analysis shows that Sales and Price have a very weak positive correlation (r = 0.103, p = 0.148), while Sales and Inventory_Levels (r = -0.035, p = 0.620) and Price and Inventory_Levels (r = -0.040, p = 0.571) are both essentially uncorrelated.
Test the hypotheses that the correlations between the variables are zero and provide a 95% confidence interval.
# Function to compute correlation test and 95% CI
cor_test_with_ci <- function(x, y) {
test <- cor.test(x, y)
data.frame(
Variable1 = deparse(substitute(x)),
Variable2 = deparse(substitute(y)),
Correlation = test$estimate,
p_value = test$p.value,
CI_lower = test$conf.int[1],
CI_upper = test$conf.int[2]
)
}
# Apply to all pairs
test_results <- rbind(
cor_test_with_ci(retail_new$Sales, retail_new$Price),
cor_test_with_ci(retail_new$Sales, retail_new$Inventory_Levels),
cor_test_with_ci(retail_new$Price, retail_new$Inventory_Levels)
)
test_results
## Variable1 Variable2 Correlation p_value
## cor retail_new$Sales retail_new$Price 0.10272730 0.1477542
## cor1 retail_new$Sales retail_new$Inventory_Levels -0.03529619 0.6197608
## cor2 retail_new$Price retail_new$Inventory_Levels -0.04025941 0.5713837
## CI_lower CI_upper
## cor -0.03653442 0.23807516
## cor1 -0.17318909 0.10395389
## cor2 -0.17800614 0.09903478
The 95% confidence intervals for all correlations include zero, confirming that these relationships are not statistically significant. These results suggest that changes in Price or Inventory_Levels are unlikely to strongly impact Sales in this dataset.
From an inventory management perspective, there is no evidence of a strong link between inventory levels and sales, so decisions about stocking may need to rely on other factors such as seasonality or demand forecasts, which are not included here. Additionally, because all predictor correlations are very low, multicollinearity would not be a concern in a potential regression model.
# Linear regression
reg_model <- lm(Sales ~ Price, data = retail_new)
reg_tidy <- tidy(reg_model)
reg_tidy
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 443. 137. 3.22 0.00148
## 2 Price 9.92 6.82 1.45 0.148
# Correlation matrix
cor_matrix |>
as.data.frame() |>
rownames_to_column("Variable") |>
pivot_longer(-Variable, names_to = "Compare", values_to = "Correlation")
## # A tibble: 9 × 3
## Variable Compare Correlation
## <chr> <chr> <dbl>
## 1 Sales Sales 1
## 2 Sales Price 0.103
## 3 Sales Inventory_Levels -0.0353
## 4 Price Sales 0.103
## 5 Price Price 1
## 6 Price Inventory_Levels -0.0403
## 7 Inventory_Levels Sales -0.0353
## 8 Inventory_Levels Price -0.0403
## 9 Inventory_Levels Inventory_Levels 1
# Precision matrix (inverse of correlation)
precision_matrix <- solve(cor_matrix)
precision_tidy <- as.data.frame(precision_matrix) |>
rownames_to_column("Variable") |>
pivot_longer(-Variable, names_to = "Compare", values_to = "Precision")
precision_tidy
## # A tibble: 9 × 3
## Variable Compare Precision
## <chr> <chr> <dbl>
## 1 Sales Sales 1.01
## 2 Sales Price -0.103
## 3 Sales Inventory_Levels 0.0316
## 4 Price Sales -0.103
## 5 Price Price 1.01
## 6 Price Inventory_Levels 0.0371
## 7 Inventory_Levels Sales 0.0316
## 8 Inventory_Levels Price 0.0371
## 9 Inventory_Levels Inventory_Levels 1.00
# LU decomposition of correlation matrix
lu_decomp <- lu(cor_matrix)
# Extract L and U as base R matrices
L <- as.matrix(expand(lu_decomp)$L)
U <- as.matrix(expand(lu_decomp)$U)
# Convert to tidy data frames
L_tidy <- L |>
as.data.frame() |>
tibble::rownames_to_column("Variable") |>
pivot_longer(-Variable, names_to = "Compare", values_to = "L_Value")
U_tidy <- U |>
as.data.frame() |>
tibble::rownames_to_column("Variable") |>
pivot_longer(-Variable, names_to = "Compare", values_to = "U_Value")
# Correlation tests with tidy output
cor_tests <- list(
Sales_Price = cor.test(retail_new$Sales, retail_new$Price),
Sales_Inventory = cor.test(retail_new$Sales, retail_new$Inventory_Levels),
Price_Inventory = cor.test(retail_new$Price, retail_new$Inventory_Levels)
)
cor_tests_tidy <- map_dfr(cor_tests, ~{
tibble(
Estimate = .$estimate,
t_value = .$statistic,
df = .$parameter,
p_value = .$p.value,
CI_lower = .$conf.int[1],
CI_upper = .$conf.int[2]
)
}, .id = "Comparison")
cor_tests_tidy
## # A tibble: 3 × 7
## Comparison Estimate t_value df p_value CI_lower CI_upper
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Sales_Price 0.103 1.45 198 0.148 -0.0365 0.238
## 2 Sales_Inventory -0.0353 -0.497 198 0.620 -0.173 0.104
## 3 Price_Inventory -0.0403 -0.567 198 0.571 -0.178 0.0990
Here, the relationship between Sales, Price, and Inventory Levels is analyzed using regression, correlation, and matrix decomposition methods. The linear regression of Sales on Price shows a very weak relationship (R-squared = 0.0106, p = 0.148), meaning that Price explains almost none of the variation in Sales. This is unexpected, as price and sales are often directly linked. The diagonal elements of the precision matrix are close to 1, suggesting minimal multicollinearity among Sales, Price, and Inventory_Levels. LU decomposition confirms the correlation matrix is numerically stable and well-conditioned, but reveals no additional dependencies. Overall, Price has limited influence on Sales, so other factors should be considered to explain sales patterns.
In this dataset, the weak regression coefficient and low R-squared suggest that Sales are relatively inelastic with respect to Price, meaning small changes in Price have minimal effect on demand. Pricing alone won’t move sales. Other strategies are needed to drive the revenue.
# Fit exponential to Sales
exp_fit <- fitdistr(retail_new$Sales, "exponential")
rate_hat <- as.numeric(exp_fit$estimate["rate"])
exp_fit
## rate
## 0.0015700652
## (0.0001110204)
# Generate 1,000 samples from fitted exponential
set.seed(64)
samples <- rexp(1000, rate = rate_hat)
# Prep data frames for plotting
df_orig <- tibble(Sales = retail_new$Sales, Type = "Original")
df_sim <- tibble(Sales = samples, Type = "Exponential Sim")
df_both <- bind_rows(df_orig, df_sim)
# Plot an overlaid histogram
ggplot(df_both, aes(x = Sales, fill = Type)) +
geom_histogram(position = "identity", alpha = 0.55, bins = 60, color = "gray70") +
scale_fill_manual(values = c("Original" = "aquamarine",
"Exponential Sim" = "salmon")) +
labs(title = "Original Sales vs. Simulated Exponential Samples",
x = "Sales", y = "Count") +
theme_minimal()
# Theoretical percentiles from exponential CDF
theor_pct <- qexp(c(0.05, 0.95), rate = rate_hat) |> set_names(c("5%", "95%"))
# Empirical percentiles
emp_pct <- quantile(retail_new$Sales, probs = c(0.05, 0.95)) |> set_names(c("5%", "95%"))
# 95% CI for the mean assuming normality
n <- length(retail_new$Sales)
xbar <- mean(retail_new$Sales)
s <- sd(retail_new$Sales)
tcrit <- qt(0.975, df = n - 1)
ci_mean <- c(xbar - tcrit * s / sqrt(n), xbar + tcrit * s / sqrt(n)) |>
set_names(c("CI_lower", "CI_upper"))
# Print tidy results
list(
Fit = exp_fit,
Theoretical_percentiles = theor_pct,
Empirical_percentiles = emp_pct,
Mean_and_95CI = c(mean = xbar, ci_mean)
)
## $Fit
## rate
## 0.0015700652
## (0.0001110204)
##
## $Theoretical_percentiles
## 5% 95%
## 32.66953 1908.03045
##
## $Empirical_percentiles
## 5% 95%
## 104.9028 1502.2498
##
## $Mean_and_95CI
## mean CI_lower CI_upper
## 636.9162 572.2866 701.5458
I fit an exponential distribution to the right-skewed Sales data, simulated 1,000 samples from the fitted exponential, and plotted those samples together with the original Sales histogram for visual comparison. I computed the 5th and 95th percentiles from the fitted exponential and compared them to the empirical 5th/95th percentiles from the data. I also calculated a 95% confidence interval for the mean (assuming normality) to compare with the empirical percentiles. The exponential fit produced a rate of 0.00157, and the simulated data show a typical exponential shape, strong right skew, and many small values. The theoretical percentiles (5% = 32.7, 95% = 1908.0) differ noticeably from the empirical percentiles of the real data (5% = 104.9, 95% = 1502.2). This mismatch suggests the exponential model does not capture the true spread or tail behavior of the sales data very well. The normal-based 95% confidence interval for the mean (572.3 to 701.5) lands nicely around the actual mean (636.9), but normality assumptions do not hold for strongly skewed data. This CI is less meaningful for forecasting actual sales values.
The exponential distribution does not closely model the data. The comparison histogram shows the difference between the original Sales and the simulated exponential data distributions. The simulated exponential tends to put too much mass near zero and declines too quickly compared to the real Sales data, which has a heavier right tail. The simulated exponential percentiles (33 and 1908) were much wider and more extreme than the empirical percentiles (105 and 1502), showing that the exponential distribution overestimates the lower tail and over-stretches the upper tail. This mismatch occurs because the data are right-skewed but not nearly as sharply as an exponential distribution assumes. For forecasting, this means an exponential model would generate too many very small values and too many extreme large values, making it unreliable for predicting actual future sales. My suggestion would be to use a log-normal or gamma distribution to try and provide a better fit for this type of skewed data.
## # A tibble: 3 × 6
## Dist Param1 Param2 AIC KS_stat KS_pvalue
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Exponential 0.00157 NA 2985. 0.140 0.000783
## 2 Lognormal 6.16 0.846 2969. 0.0711 0.264
## 3 Gamma 1.83 0.00288 2950. 0.0428 0.857
I fit Exponential, Lognormal, and Gamma distributions to the Sales data using maximum likelihood. I compared fits by plotting the empirical histogram with the three fitted density curves and by computing AIC and KS statistics. I also calculated the 5th and 95th percentiles for the Lognormal and Gamma fits for direct comparison with the empirical percentiles. The Exponential fails to match the data, but both the Lognormal and Gamma fit much better. AIC and KS tests show the Gamma distribution is the best overall fit, capturing both the main body and the right tail most accurately. Its percentiles are closest to the empirical values, meaning it reflects the actual sales range more realistically. For forecasting, the Gamma is the most reliable choice, with the Lognormal as a reasonable backup.
# Multiple Regression Model
multi_reg_model <- lm(Inventory_Levels ~ Sales + Lead_Time_Days + Price, data = retail_new)
# Tidy summary + model-level stats
model_coef <- tidy(multi_reg_model)
model_glance <- glance(multi_reg_model)
model_coef
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 465. 61.3 7.58 1.35e-12
## 2 Sales -0.00781 0.0240 -0.326 7.45e- 1
## 3 Lead_Time_Days 7.32 5.29 1.38 1.68e- 1
## 4 Price -1.09 2.31 -0.472 6.38e- 1
model_glance
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0122 -0.00289 155. 0.809 0.490 3 -1291. 2592. 2608.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
I fit a multiple linear model predicting Inventory_Levels from Sales, Lead_Time_Days, and Price (coefficients and 95% CIs are shown). The R-squared is very low, so these predictors explain almost none of the variance in inventory. Coefficients have high uncertainty, suggesting other factors like seasonality or product-level differences are likely more important.
# Quick residual diagnostics
par(mfrow = c(1,2))
plot(multi_reg_model, which = 1) # Residuals vs Fitted
plot(multi_reg_model, which = 2) # Normal Q-Q
par(mfrow = c(1,1))
# Numerical diagnostics
resid_sd <- sigma(multi_reg_model)
vif_vals <- vif(multi_reg_model)
shapiro_p <- shapiro.test(residuals(multi_reg_model))$p.value
# Results
list(residual_sd = resid_sd, VIF = vif_vals, shapiro_p = shapiro_p)
## $residual_sd
## [1] 155.2702
##
## $VIF
## Sales Lead_Time_Days Price
## 1.017562 1.008633 1.011822
##
## $shapiro_p
## [1] 0.4239664
I checked residuals (residual SD ≈ 155), normality (Shapiro p ≈ 0.424), and plotted Residuals vs Fitted and a Q–Q plot. VIFs are ~1, so multicollinearity is not an issue, and no strong skew in residuals was detected, but formal tests (Breusch–Pagan, Durbin–Watson) should be used if time series or grouped structure exists. Overall, plots show no severe violations, but the model fit is weak, so predictions should be interpreted cautiously.
# Example peak scenarios
scenarios <- tibble(
Sales = c(1000, 1500),
Lead_Time_Days = c(5, 7),
Price = c(20, 25)
)
# Predict with 95% prediction interval
pred_tbl <- predict(multi_reg_model, newdata = scenarios, interval = "prediction", level = 0.95)
pred_tbl <- as_tibble(pred_tbl) |> setNames(c("Fit", "Lower", "Upper"))
# Bind predictions and compute upper-bound inventory
scenarios <- scenarios |>
bind_cols(pred_tbl) |>
mutate(Optimized_Inventory = round(Upper, 1))
# View results
scenarios
## # A tibble: 2 × 7
## Sales Lead_Time_Days Price Fit Lower Upper Optimized_Inventory
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1000 5 20 472. 164. 780. 780.
## 2 1500 7 25 477. 167. 787. 788.
Using the multiple regression model, I generated predictions for two peak sales scenarios. The 95% prediction interval accounts for uncertainty in inventory levels. The upper bound of the interval is used to recommend “optimized” inventory to ensure sufficient stock. This provides a baseline, though additional factors like seasonality or product-specific demand should also be considered.
# Summary table
list(
Coefficients = model_coef,
Model_Summary = model_glance,
Diagnostics = list(residual_sd = resid_sd, VIF = vif_vals, shapiro_p = shapiro_p),
Example_Optimized = scenarios
)
## $Coefficients
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 465. 61.3 7.58 1.35e-12
## 2 Sales -0.00781 0.0240 -0.326 7.45e- 1
## 3 Lead_Time_Days 7.32 5.29 1.38 1.68e- 1
## 4 Price -1.09 2.31 -0.472 6.38e- 1
##
## $Model_Summary
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0122 -0.00289 155. 0.809 0.490 3 -1291. 2592. 2608.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
##
## $Diagnostics
## $Diagnostics$residual_sd
## [1] 155.2702
##
## $Diagnostics$VIF
## Sales Lead_Time_Days Price
## 1.017562 1.008633 1.011822
##
## $Diagnostics$shapiro_p
## [1] 0.4239664
##
##
## $Example_Optimized
## # A tibble: 2 × 7
## Sales Lead_Time_Days Price Fit Lower Upper Optimized_Inventory
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1000 5 20 472. 164. 780. 780.
## 2 1500 7 25 477. 167. 787. 788.
The regression explains almost none of the inventory variance and no predictor is significant. Multicollinearity is low, but missing variables likely drive inventory more than Sales, Price, or Lead Time. The prediction intervals from the model give a principled baseline, but final inventory levels should rely on contextual adjustments and business rules.
This analysis explored retail sales, inventory levels, lead times, and pricing to support data-driven decision-making. I fitted theoretical distributions to key variables, showing that Sales are best modeled by a Gamma distribution, Inventory Levels by a Lognormal distribution, and Lead Times by a Normal distribution. Empirical and theoretical statistics matched closely, confirming the appropriateness of these models.
Correlation and independence tests revealed weak relationships between Sales, Price, and Inventory Levels, suggesting that pricing alone does not drive sales, and inventory decisions depend on additional factors like seasonality or demand patterns. Multiple regression for inventory prediction showed low explanatory power, indicating that other variables not captured here are likely more important. Prediction intervals provide a practical baseline for inventory planning under uncertainty, but business context and operational rules remain critical.
Overall, the project demonstrates how statistical modeling and regression analysis can inform retail decisions, identify key drivers of variability, and guide inventory optimization, while highlighting the limitations of relying on a small set of predictors.