Setup

I read the synthetic retail data into a dataframe and performed some basic checks on data types, duplicates, and missing data.

url <- 'https://raw.githubusercontent.com/alexandersimon1/Data605/refs/heads/main/synthetic_retail_data.csv'
retail_df <- read_csv(url, show_col_types = FALSE)
glimpse(retail_df)
## Rows: 200
## Columns: 6
## $ Product_ID        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Sales             <dbl> 158.43952, 278.99020, 698.85868, 1832.39467, 459.703…
## $ Inventory_Levels  <dbl> 367.4421, 426.6512, 407.6394, 392.3912, 448.3120, 54…
## $ Lead_Time_Days    <dbl> 6.314587, 5.800673, 3.071936, 3.534253, 10.802241, 1…
## $ Price             <dbl> 18.795197, 26.089636, 22.399985, 27.092013, 18.30782…
## $ Seasonality_Index <dbl> 1.1839497, 0.8573051, 0.6986774, 0.6975404, 0.840725…
duplicate_rows <- nrow(retail_df) - nrow(distinct(retail_df))
sprintf("Duplicate rows in dataset: %d", duplicate_rows)
## [1] "Duplicate rows in dataset: 0"
get_na_columns <- function(df) {
  # count NAs in all columns, keeping those with NAs
  na_columns <- map(df, ~ sum(is.na(.))) %>%
    keep(~ .x > 0)
  
  if (length(na_columns) == 0) {
    return("There are no columns with missing values")
  }
  return(na_columns)
}

get_na_columns(retail_df)
## [1] "There are no columns with missing values"

The data are fine, so now I proceed with the exam problems.


Problem 1: Business Risk and Revenue Modeling

Context: You are a data scientist working for a retail chain that models sales, inventory levels, and the impact of pricing and seasonality on revenue. Your task is to analyze various distributions that can describe sales variability and forecast potential revenue.

Part 1: Empirical and Theoretical Analysis of Distributions (5 Points)

Generate and analyze distributions

X ~ Sales: Consider the Sales variable from the dataset. Assume it follows a Gamma distribution and estimate its shape and scale parameters using the fitdistr function from the MASS package.

Answer: shape = 1.834963 and scale = 347

# https://cran.r-project.org/web/packages/MASS/MASS.pdf
require(MASS)
set.seed(128)

# Note that the 'lower' parameter needs to be defined to prevent
# the optim function from trying negative estimates of the parameters
sales_gamma_fit <- fitdistr(retail_df$Sales, "gamma", lower = c(0, 0))
sales_gamma_fit
##       shape         rate    
##   1.834962588   0.002881012 
##  (0.151175450) (0.000255698)

Calculate the scale

rate <- sales_gamma_fit$estimate[[2]]
scale <- 1 / rate
scale
## [1] 347.1003

Y ~ Inventory Levels: Assume that the sum of inventory levels across similar products follows a Lognormal distribution. Estimate the parameters for this distribution.

Answer: mean = 6.133, standard deviation = 0.3633 (log scale)

inventory_lognormal_fit <- fitdistr(retail_df$Inventory_Levels, "lognormal")
inventory_lognormal_fit
##     meanlog       sdlog   
##   6.13303680   0.36332727 
##  (0.02569112) (0.01816636)

Z ~ Lead Time: Assume that Lead_Time_Days follows a Normal distribution. Estimate the mean and standard deviation.

Answer: mean = 6.834, standard deviation = 2.083

lead_time_normal_fit <- fitdistr(retail_df$Lead_Time_Days, "normal")
lead_time_normal_fit
##      mean         sd    
##   6.8342981   2.0832137 
##  (0.1473055) (0.1041607)

Calculate empirical expected value and variance

  • Calculate the empirical mean and variance for all three variables.

  • Compare these empirical values with the theoretical values derived from the estimated distribution parameters.

    Answer: In general, the theoretical (predicted) mean and variance were close to the empirical (actual) values. The residuals (ie, theoretical minus empirical values) for inventory (log-normal) was a little larger than those for sales (gamma) and lead time (normal). Calculations are shown below, followed by a summary table of the results.

First, I calculated the empirical means and variances.

# Empirical mean and variance
sales_emp_mean <- mean(retail_df$Sales)
sales_emp_var <- var(retail_df$Sales)
inventory_emp_mean <- mean(retail_df$Inventory_Levels)
inventory_emp_var <- var(retail_df$Inventory_Levels)
lead_emp_mean <- mean(retail_df$Lead_Time_Days)
lead_emp_var <- var(retail_df$Lead_Time_Days)

# Create data table
empirical_data = matrix(c(sales_emp_mean, sales_emp_var, 
                          inventory_emp_mean, inventory_emp_var,
                          lead_emp_mean, lead_emp_var), 
                        ncol = 2, byrow = TRUE)
colnames(empirical_data) <- c('Empirical_Mean', 'Empirical_Variance')
rownames(empirical_data) <- c('Sales (gamma)', 
                              'Inventory (lognormal)', 
                              'Lead Time (normal)')

emp_table <- as.table(empirical_data)

Then I calculated the theoretical means and variances.

For a gamma distribution,1

\[ mean = \frac{shape}{rate} \]

and

\[ variance = \frac{shape}{rate^2} \]

shape <- sales_gamma_fit$estimate[[1]]
sales_pred_mean <- shape / rate
sales_pred_mean
## [1] 636.916
sales_pred_var <- shape / (rate^2)
sales_pred_var
## [1] 221073.7

For a log-normal distribution,2

\[ mean = e^{(\mu + \frac{\sigma^2}{2})} \]

and

\[ variance = (e^{\sigma^2} - 1)e^{(2\mu + \sigma^2)} \]

inventory_meanlog <- inventory_lognormal_fit$estimate[[1]]
inventory_sdlog <- inventory_lognormal_fit$estimate[[2]]

inventory_pred_mean <- exp(inventory_meanlog + 0.5*inventory_sdlog^2)
inventory_pred_mean
## [1] 492.2763
inventory_pred_var <- (exp(inventory_sdlog^2) - 1) * 
                       exp(2*inventory_meanlog + inventory_sdlog^2)
inventory_pred_var
## [1] 34197.47

For the normal distribution, the mean was already estimated in part 1, so only the variance needs to be calculated from the standard deviation

lead_time_pred_mean <- lead_time_normal_fit$estimate[[1]]
lead_time_pred_mean
## [1] 6.834298
lead_time_pred_sd <- lead_time_normal_fit$estimate[[2]]
lead_time_pred_var <- lead_time_pred_sd ^ 2
lead_time_pred_var
## [1] 4.339779
# Create data table for theoretical mean and variance
theoretical_data = matrix(c(sales_pred_mean, sales_pred_var, 
                            inventory_pred_mean, inventory_pred_var,
                            lead_time_pred_mean, lead_time_pred_var), 
                          ncol = 2, byrow = TRUE)
colnames(theoretical_data) <- c('Theoretical_Mean', 'Theoretical_Variance')
rownames(theoretical_data) <- c('Sales (gamma)', 
                              'Inventory (lognormal)', 
                              'Lead Time (normal)')

theor_table <- as.table(theoretical_data)

Summary of results

require(kableExtra)

answers <- cbind(emp_table, theor_table)
answers %>%
  kbl() %>%
  kable_styling()
Empirical_Mean Empirical_Variance Theoretical_Mean Theoretical_Variance
Sales (gamma) 636.916210 2.148318e+05 636.915990 2.210737e+05
Inventory (lognormal) 488.547176 2.403945e+04 492.276304 3.419747e+04
Lead Time (normal) 6.834298 4.361587e+00 6.834298 4.339779e+00

Part 2: Probability Analysis and Independence Testing (5 Points)

Empirical Probabilities: For the Lead_Time_Days variable (assumed to be normally distributed), calculate the following empirical probabilities:

\(P(Z > \mu \ | \ Z > \mu - \sigma)\)

Answer: 0.593

By definition

\[ P(Z > \mu \ | \ Z > \mu - \sigma) = \frac{P(Z > \mu \ \cap \ Z > \mu - \sigma)}{P(Z > \mu - \sigma)} \qquad (1) \]

\(\mu\) (lead_emp_mean) was calculated above. The standard deviation \(\sigma\) is 2.09.

lead_emp_sd <- sd(retail_df$Lead_Time_Days)
lead_emp_sd
## [1] 2.088441

Next, calculate the Z score for each observation, \(Z = \frac{actual\ value\ - \ \mu}{\sigma}\)

require(tidyverse)

retail_df <- retail_df %>%
  mutate(
    Lead_Time_Z = (Lead_Time_Days - lead_emp_mean) / lead_emp_sd,
    .after = Lead_Time_Days
  )

Then calculate the numerator of equation (1)

\[ P(Z > \mu \ \cap \ Z > \mu - \sigma) = \frac{Number\ of\ observations(Z > \mu \ \cap \ Z > \mu - \sigma)}{Total\ observations} \]

require(tidyverse)

# Z > mean
Z_gt0 <- retail_df %>%
  filter(Lead_Time_Z > 0)

# Z > mean - sd
Z_gt_neg1 <- retail_df %>%
  filter(Lead_Time_Z > -1)

# Intersection
intersection_df <- intersect(Z_gt0, Z_gt_neg1)

# Calculate the empirical probability
P_numerator <- nrow(intersection_df) / nrow(retail_df)
P_numerator
## [1] 0.495

Similarly, the denominator of equation (1) is

P_denominator <- nrow(Z_gt_neg1) / nrow(retail_df)
P_denominator
## [1] 0.835

Combining the results, the desired probability is \(P(Z > \mu \ | \ Z > \mu - \sigma) = 0.593\)

P <- P_numerator / P_denominator
P
## [1] 0.5928144

\(P(Z > \mu + \sigma \ | \ Z > \mu)\)

Answer: 0.313

\[ P(Z > \mu + \sigma \ | \ Z > \mu) = \frac{P(Z > \mu + \sigma \ \cap \ Z > \mu)}{P(Z > \mu)} \qquad (1) \]

Calculate the numerator of equation (1)

\[ P(Z > \mu + \sigma \ | \ Z > \mu) = \frac{Number\ of\ observations(Z > \mu + \sigma \ \cap \ Z > \mu)}{Total\ observations} \]

require(tidyverse)

# Z > mean + sd
Z_gt1 <- retail_df %>%
  filter(Lead_Time_Z > 1)

# Z > mean was calculated in previous part

# Intersection
intersection_df <- intersect(Z_gt1, Z_gt0)

# Calculate the empirical probability
P_numerator <- nrow(intersection_df) / nrow(retail_df)
P_numerator
## [1] 0.155

Similarly, the denominator of equation (1) is

P_denominator <- nrow(Z_gt0) / nrow(retail_df)
P_denominator
## [1] 0.495

Combining the results, the desired probability is \(P(Z > \mu + \sigma \ | \ Z > \mu) = 0.313\)

P <- P_numerator / P_denominator
P
## [1] 0.3131313

\(P(Z > \mu + 2\sigma \ | \ Z > \mu)\)

Answer: 0.0303

\[ P(Z > \mu + 2\sigma \ | \ Z > \mu) = \frac{P(Z > \mu + 2\sigma \ \cap \ Z > \mu)}{P(Z > \mu)} \qquad (1) \]

Calculate the numerator of equation (1)

\[ P(Z > \mu + 2\sigma \ | \ Z > \mu) = \frac{Number\ of\ observations(Z > \mu + 2\sigma \ \cap \ Z > \mu)}{Total\ observations} \]

require(tidyverse)

# Z > mean + 2sd
Z_gt2 <- retail_df %>%
  filter(Lead_Time_Z > 2)

# Z > mean was calculated previously

# Intersection
intersection_df <- intersect(Z_gt2, Z_gt0)

# Calculate the empirical probability
P_numerator <- nrow(intersection_df) / nrow(retail_df)
P_numerator
## [1] 0.015

Similarly, the denominator of equation (1) is

P_denominator <- nrow(Z_gt0) / nrow(retail_df)
P_denominator
## [1] 0.495

Combining the results, the desired probability is \(P(Z > \mu + 2\sigma \ | \ Z > \mu) = 0.0303\)

P <- P_numerator / P_denominator
P
## [1] 0.03030303

Correlation and Independence

Investigate the correlation between Sales and Price. Create a contingency table using quartiles of Sales and Price, and then evaluate the marginal and joint probabilities.

A scatterplot of Sales vs Price does not appear to have an obvious linear correlation.

ggplot(retail_df, aes(x = Price, y = Sales)) +
  geom_point() +
  labs(x = 'Price ($)', y = 'Sales ($)') +
  ggtitle('Price vs sales in synthetic retail dataset')

A simple linear regression model of Sales vs Price shows that the adjusted \(R^2 = 0.00556\), meaning that 0.556% of the variation in Sales is explained by the variation in the model. This is very low and indicates that Sales and Price are not correlated linearly.

The F-statistic, which indicates whether the linear regression model is a better fit to the data than a model without any independent variables, corroborates this result. Since \(p = 0.148 > 0.05\), we cannot reject the null hypothesis that a model with no independent variables fits the data as well as the linear regression model.

retail_lm <- lm(Sales ~ Price, data = retail_df)
summary(retail_lm)
## 
## Call:
## lm(formula = Sales ~ Price, data = retail_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -679.54 -347.85  -98.63  241.12 1770.08 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  442.951    137.419   3.223  0.00148 **
## Price          9.916      6.824   1.453  0.14775   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 462.2 on 198 degrees of freedom
## Multiple R-squared:  0.01055,    Adjusted R-squared:  0.005556 
## F-statistic: 2.112 on 1 and 198 DF,  p-value: 0.1478

Quartiles of Price

price_quantiles <- quantile(retail_df$Price, probs = c(0, 0.25, 0.5, 0.75, 1))
price_quantiles
##        0%       25%       50%       75%      100% 
##  5.053035 16.553627 19.976877 22.923960 29.404018

Quartiles of Sales

sales_quantiles <- quantile(retail_df$Sales, probs = c(0, 0.25, 0.5, 0.75, 1))
sales_quantiles
##         0%        25%        50%        75%       100% 
##   25.57477  284.42235  533.54023  867.58104 2447.48868

The task did not specify which quantile to use to construct the contingency table and evaluate marginal and joint probabilities, so I just selected Q1. Then I counted the number of observations that satisfied the four possibilities (ie, Price greater/less than Q1 \(\times\) Sales greater/less than Q1).

# Row 1
# Sales <= 25% quantile & Price <= 25% quantile
c11 <- nrow(filter(retail_df, Sales <= sales_quantiles[2] & Price <= price_quantiles[2]))
# Sales <= 25% quantile & Price > 25% quantile
c12 <- nrow(filter(retail_df, Sales <= sales_quantiles[2] & Price > price_quantiles[2]))
# Row total
c13 <- c11 + c12

# Row 2
# Sales > 25% quantile & Price <= 25% quantile
c21 <- nrow(filter(retail_df, Sales > sales_quantiles[2] & Price <= price_quantiles[2]))
# Sales > 25% quantile & Price > 25% quantile
c22 <- nrow(filter(retail_df, Sales > sales_quantiles[2] & Price > price_quantiles[2]))
# Row total
c23 <- c21 + c22

# Row 3
# Column 1 total
c31 <- c11 + c21
# Column 2 total
c32 <- c12 + c22
# Column 3 total
c33 <- c13 + c23

Contingency table

contingency_data = matrix(c(c11, c12, c13, 
                            c21, c22, c23,
                            c31, c32, c33), ncol = 3, byrow = TRUE)
colnames(contingency_data) <- c('Price<=Q1', 'Price>Q1', 'Total')
rownames(contingency_data) <- c('Sales<=Q1', 'Sales>Q1', 'Total')

contingency_table <- as.table(contingency_data)
contingency_table
##           Price<=Q1 Price>Q1 Total
## Sales<=Q1        11       39    50
## Sales>Q1         39      111   150
## Total            50      150   200

Marginal probabilities

Using the values in the contingency table,

\[ P(Price\ \le \ Q1) = \frac{50}{200} = 0.25 \]

\[ P(Price\ > \ Q1) = \frac{150}{200} = 0.75 \]

\[ P(Sales\ \le \ Q1) = \frac{50}{200} = 0.25 \]

\[ P(Sales\ > \ Q1) = \frac{150}{200} = 0.75 \]

Joint probabilities

Using the values in the contingency table,

\[ P(Price\ \le \ Q1 \ \cap \ Sales\ \le \ Q1) = \frac{11}{200} = 0.055 \]

\[ P(Price\ > \ Q1 \ \cap \ Sales\ \le \ Q1) = \frac{39}{200} = 0.195 \]

\[ P(Price\ \le \ Q1 \ \cap \ Sales\ > \ Q1) = \frac{39}{200} = 0.195 \]

\[ P(Price\ > \ Q1 \ \cap \ Sales\ > \ Q1) = \frac{111}{200} = 0.555 \]

Use Fisher’s Exact Test and the Chi-Square Test to check for independence between Sales and Price. Discuss which test is most appropriate and why.

Chi-square test

\(p > 0.05\) so we cannot reject the null hypothesis that there is no relationship between Sales and Price.

# Remove marginal totals from contingency table
contingency_table <- contingency_table[1:2, 1:2]
chisq.test(contingency_table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 0.14222, df = 1, p-value = 0.7061

Fisher’s exact test

The p-value was nearly equal to that from the Chi-square test. \(p > 0.05\) so we cannot reject the null hypothesis that there is no relationship between Sales and Price.

fisher.test(contingency_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.7066
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.3372442 1.7985136
## sample estimates:
## odds ratio 
##  0.8036309

Discussion: In this exercise, the Chi-square and Fisher’s exact tests give nearly the same p-value and both indicated that Sales and Price variables are not related. Fisher’s exact test is preferred as long as it is computationally feasible, because it is more accurate than the chi-square test (as implied by its name). Fisher’s exact test is also needed when expected counts are very small or marginal values are grossly unbalanced.3


Problem 2: Advanced Forecasting and Optimization (Calculus) in Retail

Context: You are working for a large retail chain that wants to optimize pricing, inventory management, and sales forecasting using data-driven strategies. Your task is to use regression, statistical modeling, and calculus-based methods to make informed decisions.

Part 1: Descriptive and Inferential Statistics for Inventory Data (5 Points)

Inventory Data Analysis

Generate univariate descriptive statistics for the Inventory_Levels and Sales variables.

summary(retail_df$Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.57  284.42  533.54  636.92  867.58 2447.49
summary(retail_df$Inventory_Levels)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   67.35  376.51  483.72  488.55  600.42  858.79

Create appropriate visualizations such as histograms and scatterplots for Inventory_Levels, Sales, and Price.

I pivoted the variables of interest from wide to long to group plots together.

require(tidyverse)

retail_long_df <- retail_df %>%
  pivot_longer(names_to = 'parameter', values_to = 'value', 
               cols = c('Sales', 'Inventory_Levels', 'Price'))

The histograms show that the distribution of Inventory_Levels is relatively symmetrical (reflecting the approximately equal mean [484] and median [489] in the summary statistics above), whereas Price is skewed to the right and Sales is skewed to the left.

ggplot(retail_long_df, aes(x = value, fill = parameter)) +
  geom_histogram() +
  facet_wrap(~parameter, ncol = 1, scales = 'free') +
  theme(
    legend.position = 'none'
  )

The boxplots below show that Price and Sales have several outliers.

ggplot(retail_long_df, aes(x = parameter, y = value, color = parameter)) +
  geom_boxplot() +
  coord_flip() +
  facet_wrap(~parameter, ncol = 1, scales = 'free') +
  theme(
    legend.position = 'none'
  )

A scatterplot of Sales vs Price does not appear to have an obvious linear correlation.

ggplot(retail_df, aes(x = Price, y = Sales)) +
  geom_point() +
  labs(x = 'Price ($)', y = 'Sales ($)') +
  ggtitle('Price vs sales in synthetic retail dataset')

A scatterplot of Price vs Inventory_Levels does not appear to have an obvious linear correlation.

ggplot(retail_df, aes(x = Inventory_Levels, y = Price)) +
  geom_point() +
  labs(x = 'Inventory Level', y = 'Price ($)') +
  ggtitle('Inventory level vs price in synthetic retail dataset')

A scatterplot of Inventory_Levels vs Sales does not appear to have an obvious linear correlation.

ggplot(retail_df, aes(x = Inventory_Levels, y = Sales)) +
  geom_point() +
  labs(x = 'Inventory Level', y = 'Sales ($)') +
  ggtitle('Inventory level vs sales in synthetic retail dataset')

Compute a correlation matrix for Sales, Price, and Inventory_Levels

The correlation matrix below shows that the pairwise correlations between these variables is generally very low (<0.1).

retail_cor <- cor(retail_df[c('Sales', 'Price', 'Inventory_Levels')])
retail_cor
##                        Sales       Price Inventory_Levels
## Sales             1.00000000  0.10272730      -0.03529619
## Price             0.10272730  1.00000000      -0.04025941
## Inventory_Levels -0.03529619 -0.04025941       1.00000000

The correlation plot below shows this graphically. The light blue and light pink circles off the main diagonal shows that the correlation coefficients are close to zero.

require(corrplot)

# diag=FALSE hides the circles on the main diagonal as they are visually distracting
# and not relevant
corrplot(retail_cor, diag = FALSE, outline = TRUE, tl.col = "black", bg = "darkgray")

Test the hypotheses that the correlations between the variables are zero and provide a 95% confidence interval

I assessed pairwise correlations between Sales, Price, and Inventory_Levels using Pearson correlation tests.

For Sales vs Price, the sample estimate of the correlation coefficient is 0.103 with a 95% confidence interval of (-0.0365, 0.2381). Since the confidence interval includes zero and \(p = 0.1 > 0.05\), we cannot reject the null hypothesis that the true correlation is zero.

cor.test(retail_df$Sales, retail_df$Price, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  retail_df$Sales and retail_df$Price
## t = 1.4532, df = 198, p-value = 0.1478
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03653442  0.23807516
## sample estimates:
##       cor 
## 0.1027273

For Sales vs Inventory_Levels, the sample estimate of the correlation coefficient is -0.0353 with a 95% confidence interval of (-0.173, 0.104). Since the confidence interval includes zero and \(p = 0.6 > 0.05\), we cannot reject the null hypothesis that the true correlation is zero.

cor.test(retail_df$Sales, retail_df$Inventory_Levels, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  retail_df$Sales and retail_df$Inventory_Levels
## t = -0.49697, df = 198, p-value = 0.6198
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1731891  0.1039539
## sample estimates:
##         cor 
## -0.03529619

For Price vs Inventory_Levels, the sample estimate of the correlation coefficient is -0.0403 with a 95% confidence interval of (-0.178, 0.099). Since the confidence interval includes zero and \(p = 0.6 > 0.05\), we cannot reject the null hypothesis that the true correlation is zero.

cor.test(retail_df$Price, retail_df$Inventory_Levels, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  retail_df$Price and retail_df$Inventory_Levels
## t = -0.56696, df = 198, p-value = 0.5714
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17800614  0.09903478
## sample estimates:
##         cor 
## -0.04025941

Discussion

Explain the meaning of your findings and discuss the implications of the correlations for inventory management. Would you be concerned about multicollinearity in a potential regression model? Why or why not?

The correlation matrix and Pearson correlation tests show that, based on the sample retail data, Sales, Price, and Inventory_Level are not significantly correlated with each other. This suggests that, in the population (ie, all retail data that the business generates), the true correlation between these variables is zero. This would mean that inventory cannot be predicted from product price or sales using a linear relationship and would need to be managed using some other strategy (eg, nonlinear).

I would not be concerned about multicollinearity in a regression model because none of the variables are highly correlated with each other.


Part 2: Linear Algebra and Pricing Strategy (5 Points)

Price Elasticity of Demand

Use linear regression to model the relationship between Sales and Price (assuming Sales as the dependent variable).

I performed this analysis in Problem 1, Task 2, so I only show the regression summary below.

The regression equation is \(Sales = 9.92 \times Price + 442.95\). However, the adjusted \(R^2 = 0.00556\) is very low and the F-statistic p-value is not statistically significant, so the model accounts for <1% of the variation in Sales and is not any better than a model with no independent variables.

summary(retail_lm)
## 
## Call:
## lm(formula = Sales ~ Price, data = retail_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -679.54 -347.85  -98.63  241.12 1770.08 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  442.951    137.419   3.223  0.00148 **
## Price          9.916      6.824   1.453  0.14775   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 462.2 on 198 degrees of freedom
## Multiple R-squared:  0.01055,    Adjusted R-squared:  0.005556 
## F-statistic: 2.112 on 1 and 198 DF,  p-value: 0.1478

Invert the correlation matrix from your model, and calculate the precision matrix.

# correlation matrix
sales_price_cor <- cor(retail_df[c('Sales', 'Price')])

# invert correlation matrix
precision_matrix <- solve(sales_price_cor)
precision_matrix
##            Sales      Price
## Sales  1.0106655 -0.1038229
## Price -0.1038229  1.0106655

Discuss the implications of the diagonal elements of the precision matrix (which are variance inflation factors).

Answer:

The variance inflation factor (VIF) is a measure of the degree to which the variance of a regression coefficient is inflated due to multicollinearity and is related to the \(R^2\) of a regression model that predicts \(x_j\) using all other predictors \(x_{i \ne j}\).4

\[ VIF(x_j) = \frac{1}{1 - R^2_{x_j|x_i \ (i \ne j)}} \]

In the precision matrix above, \(VIF = 1.011\), which means that the regression coefficients in the regression model are only slightly inflated due to multicollinearity. The corresponding \(R^2_{x_j|x_i \ (i \ne j)} = 0.0188\) further indicates that Sales and Price are not correlated with each other.

Perform LU decomposition on the correlation matrix and interpret the results in the context of price elasticity.

The LU decomposition of the correlation matrix is shown below.

require(matlib)

matlib::LU(sales_price_cor)
## $P
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
## 
## $L
##           [,1] [,2]
## [1,] 1.0000000    0
## [2,] 0.1027273    1
## 
## $U
##      Sales     Price
## [1,]     1 0.1027273
## [2,]     0 0.9894471

Answer:

According to a Wikipedia article about price elasticity of demand,5 demand can be modeled as a linear function of price, ie \(Demand = Elasticity \times Price\), where elasticity reflects the change in demand for a product relative to change in price.

Assuming that sales is a function of demand, then the elasticity matrix is the correlation matrix between Sales and Price. The lower left element of the lower triangular matrix \(L\) and the upper right element of the upper triangular matrix \(U\) (0.103) indicate that Sales changes at approximately one-tenth the rate of change in Price (and vice versa). This suggests that the products are relatively inelastic and that sales (demand) will not change much if prices increase.6


Part 3: Calculus-Based Probability & Statistics for Sales Forecasting (5 Points)

Sales Forecasting Using Exponential Distribution

Identify a variable in the dataset that is skewed to the right (e.g., Sales or Price) and fit an exponential distribution to this data using the fitdistr function.

As shown in the histograms for these variables (Problem 2 part 1), Price is skewed to the right.

require(MASS)
set.seed(128)

price_exp_fit <- fitdistr(retail_df$Price, "exponential", lower = 0)
price_exp_fit
##       rate    
##   0.051124252 
##  (0.003615031)

Generate 1,000 samples from the fitted exponential distribution and compare a histogram of these samples with the original data’s histogram.

Because the empirical and fitted data are on different scales, I normalized the data.

require(dplyr)

# Generate a sample of fitted data
rate <- price_exp_fit$estimate[[1]]
price_exp_fitted <- tibble(value = rexp(1000, rate))

# Normalize the values
price_exp_norm <- price_exp_fitted %>%
  mutate(
    normalized = (value - min(value)) / (max(value) - min(value)),
    dataset = 'fitted'
  ) %>%
  dplyr::select(-c(value))

Normalize the empirical prices

require(dplyr)

price_emp_norm <- retail_df %>%
  dplyr::select(value = Price) %>%
  mutate(
    normalized = (value - min(value)) / (max(value) - min(value)),
    dataset = 'empirical',
    .after = value
  ) %>%
  dplyr::select(-c(value))

Combine the empirical and fitted dataframes

price_df <- rbind(price_exp_norm, price_emp_norm)

The density histograms below show that the distribution of the empirical Price data does not match the distribution of the fitted exponential data.

require(ggplot2)

ggplot(price_df, aes(x = normalized, fill = dataset, color = dataset)) +
  geom_histogram(aes(y = after_stat(density)), alpha = 0.5) +
  labs(x = 'Normalized Price', y = 'Density') +
  facet_wrap(~dataset) +
  ggtitle("Histograms of empirical vs fitted exponential data")

Calculate the 5th and 95th percentiles using the cumulative distribution function (CDF) of the exponential distribution.

5th percentile

pexp(0.05, rate)
## [1] 0.002552948

95th percentile

pexp(0.95, rate)
## [1] 0.04740748

Compute a 95% confidence interval for the original data assuming normality and compare it with the empirical percentiles.

If the data are normally distributed, then 95% will be within \(\pm 2\) standard deviations of the mean or between $9.96 and $29.16.

price_mean <- mean(retail_df$Price)
price_sd <- sd(retail_df$Price)
lower_bound <- price_mean - 2*price_sd
upper_bound <- price_mean + 2*price_sd

sprintf("95%% of the prices would be between $%.2f and $%.2f", lower_bound, upper_bound)
## [1] "95% of the prices would be between $9.96 and $29.16"

The actual 5th and 95th percentiles of Price in the synthetic retail dataset are $11.41 and $27.03, respectively.

quantile(retail_df$Price, c(0.05, 0.95))
##       5%      95% 
## 11.41108 27.03023

Discussion

Discuss how well the exponential distribution models the data and what this implies for forecasting future sales or pricing. Consider whether a different distribution might be more appropriate.

Answer: As shown by the histograms above, an exponential distribution model does not fit the empirical price data very well. This means that predicting future prices using an exponential model would not be very accurate. Below, I show that a normal distribution would be a more appropriate model.

Fit a normal distribution to the data

require(MASS)
set.seed(128)

price_normal_fit <- fitdistr(retail_df$Price, "normal")
price_normal_fit
##       mean          sd    
##   19.5601883    4.7895520 
##  ( 0.3386725) ( 0.2394776)

Generate samples from the fitted normal distribution

require(dplyr)

# Generate a sample of fitted data
mean_estimate <- price_normal_fit$estimate[[1]]
sd_estimate <- price_normal_fit$estimate[[2]]
price_normal_fitted <- tibble(value = rnorm(1000, mean_estimate, sd_estimate))

# Normalize the values
price_normal_norm <- price_normal_fitted %>%
  mutate(
    normalized = (value - min(value)) / (max(value) - min(value)),
    dataset = 'fitted'
  ) %>%
  dplyr::select(-c(value))

# Combine dataframes
price_df2 <- rbind(price_normal_norm, price_emp_norm)

The density histograms show that the distribution of the empirical Price data is a much closer match to the distribution of the fitted normal data.

require(ggplot2)

ggplot(price_df2, aes(x = normalized, fill = dataset, color = dataset)) +
  geom_histogram(aes(y = after_stat(density)), alpha = 0.5) +
  labs(x = 'Normalized Price', y = 'Density') +
  facet_wrap(~dataset) +
  ggtitle("Histograms of empirical vs fitted normal data")


Part 4: Regression Modeling for Inventory Optimization (10 Points)

Multiple Regression Model

  • Build a multiple regression model to predict Inventory_Levels based on Sales, Lead_Time_Days, and Price.
  • Provide a full summary of your model, including coefficients, R-squared value, and residual analysis.

Approach 1

My initial thought was to create a multiple linear regression model of Inventory_Levels as the dependent variable and Sales, Lead_Time_Days, and Price as the independent variables. However, none of the p-values of the coefficients were significant (ie, all \(p>0.05\)), not including the intercept. In addition, \(R^2 = 0.0122\) means that only about 1% of the variation in inventory levels could be explained by the model. Finally, the F-statistic \(p=0.49 > 0.05\) indicates that the model does not perform any better than a model with no independent variables.

inventory_lm <- lm(Inventory_Levels ~ Sales + Lead_Time_Days + Price, data = retail_df)
summary(inventory_lm)
## 
## Call:
## lm(formula = Inventory_Levels ~ Sales + Lead_Time_Days + Price, 
##     data = retail_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.54 -118.07   -7.68  111.81  372.56 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    464.792662  61.321852   7.580 1.35e-12 ***
## Sales           -0.007809   0.023955  -0.326    0.745    
## Lead_Time_Days   7.316793   5.293049   1.382    0.168    
## Price           -1.087778   2.305846  -0.472    0.638    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 155.3 on 196 degrees of freedom
## Multiple R-squared:  0.01223,    Adjusted R-squared:  -0.002887 
## F-statistic: 0.8091 on 3 and 196 DF,  p-value: 0.4902

Residual analysis

  • Mean squared error - The MSE of the residuals is 23626.66
mean(inventory_lm$residuals ^ 2)
## [1] 23626.66
  • Normality of residuals - It is a little hard to tell from the histogram whether the residuals are normally distributed (ie, bell-shaped). There may be a slight skew to the right.
residuals <- residuals(inventory_lm)
n_bins <- 30
hist(residuals, breaks = seq(min(residuals), max(residuals), length.out = n_bins + 1), 
     main = "Histogram of Residuals", xlab = "Residuals", ylab = "Count")

Calculating the skewness confirms that there is a slight positive skew.

require(moments)

skewness(inventory_lm$residuals)
## [1] 0.1047673

The Q-Q plot suggests that the residuals are not completely normal, with residuals deviating from the diagonal at quantiles less than -1.

plot(inventory_lm, which = 2)


However, the Shapiro-Wilk test P>0.05, so we cannot reject the null hypothesis that the residuals are normally distributed.

shapiro.test(inventory_lm$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  inventory_lm$residuals
## W = 0.99271, p-value = 0.424
  • Residuals vs fitted plot - The residuals appear to be evenly scattered around \(y=0\), which suggests that there isn’t a nonlinear trend.
plot(inventory_lm, which = 1)

  • Variance of residuals - The scale-location plot shows that standardized residuals are roughly balanced on either side of the red curve, which indicates equal variance of the residuals
plot(inventory_lm, which = 3)

In addition, the Breusch-Pagan test P>0.05 so we cannot reject the null hypothesis that homoscedasticity is present in data, which agrees with the scale-location plot.

require(lmtest)

bptest(inventory_lm)
## 
##  studentized Breusch-Pagan test
## 
## data:  inventory_lm
## BP = 3.6206, df = 3, p-value = 0.3055

Interpretation: The residual analysis suggests that the model is appropriate; however, the low \(R^2\) and F statistic P-value >0.05 indicate that the model accounts for very little of the variation in inventory levels (ie, it’s not useful for prediction).


Approach 2

Next, I investigated whether data transformation could improve the model. The residual analysis above (Approach 1) indicated that the residuals are normally distributed and have constant variance, which suggests that the dependent variable does not need to be transformed.

Consistent with this, a Box-Cox power transformation showed that the optimal power of the dependent variable is 0.858 (ie, close to 1).7,8

require(car)

powerTransform(lm(retail_df$Inventory_Levels ~ 1))
## Estimated transformation parameter 
##        Y1 
## 0.8578132

The boxcox() function in the MASS package shows this visually:

require(MASS)

boxcox(lm(retail_df$Inventory_Levels ~ 1))

To assess whether the independent variables would benefit from transformation, I tried to perform a Box-Tidwell transformation,9 but I got an error that I couldn’t resolve.

require(car)

boxTidwell(Inventory_Levels ~ Sales + Price + Lead_Time_Days, data = retail_df)
## Error in lm.fit(cbind(1, x1.p, x2), y, ...): NA/NaN/Inf in 'x'

Approach 3

I then tried transforming Inventory_Levels based on the concepts of lead time and inventory turnover.10 I did this by defining a parameter that I call “surplus inventory”, which reflects the inventory level in excess of what is needed to meet demand while waiting for replenishment of products that have been sold. I also deseasonalized the Price data (which I use as a proxy for demand) to remove seasonal effects.11

These modifications resulted in a model with an adjusted \(R^2 = 0.421\), which means that it explains 42.1% of the variation in Surplus_Inventory.

retail_df <- retail_df %>%
  mutate(
    Price_Deseasonalized = Price / Seasonality_Index,
    Quantity_Sold = Sales / Price_Deseasonalized,
    Surplus_Inventory = Inventory_Levels - Quantity_Sold * Lead_Time_Days,
    .after = Price
  )

The coefficients of all terms were statistically significant and the F-statistic \(P<0.05\).

inventory_SP <- lm(Surplus_Inventory ~ Sales + Lead_Time_Days + Price, data = retail_df)
summary(inventory_SP)
## 
## Call:
## lm(formula = Surplus_Inventory ~ Sales + Lead_Time_Days + Price, 
##     data = retail_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -658.96 -121.89   17.11  128.38  471.27 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    349.94359   72.79568   4.807 3.04e-06 ***
## Sales           -0.31565    0.02844 -11.100  < 2e-16 ***
## Lead_Time_Days -23.02149    6.28342  -3.664  0.00032 ***
## Price           13.89741    2.73729   5.077 8.88e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 184.3 on 196 degrees of freedom
## Multiple R-squared:  0.4298, Adjusted R-squared:  0.4211 
## F-statistic: 49.25 on 3 and 196 DF,  p-value: < 2.2e-16

The MSE of the residuals is 33295.31

mean(inventory_SP$residuals ^ 2)
## [1] 33295.31

The adjusted \(R^2\) increased to 0.541 by using Quantity_Sold in the regression instead of Sales and Price.

inventory_QS <- lm(Surplus_Inventory ~ Quantity_Sold + Lead_Time_Days, 
                   data = retail_df)
summary(inventory_QS)
## 
## Call:
## lm(formula = Surplus_Inventory ~ Quantity_Sold + Lead_Time_Days, 
##     data = retail_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.56 -124.09  -13.33  113.65  390.03 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    652.1385    43.4906  14.995  < 2e-16 ***
## Quantity_Sold   -6.6337     0.4415 -15.025  < 2e-16 ***
## Lead_Time_Days -24.3457     5.5850  -4.359  2.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 164.1 on 197 degrees of freedom
## Multiple R-squared:  0.5457, Adjusted R-squared:  0.541 
## F-statistic: 118.3 on 2 and 197 DF,  p-value: < 2.2e-16

The MSE of the residuals is 26530.25

mean(inventory_QS$residuals ^ 2)
## [1] 26530.25

The MSE of the model derived from Approach 1 was lower (23626.66), but given that its \(R^2\) was close to 0, the inventory_QS model appears to be better despite having a higher MSE.

The Akaike information criterion (AIC) test results are similar (lower AIC is preferred).

AIC(inventory_lm, inventory_SP, inventory_QS)
##              df      AIC
## inventory_lm  5 2591.602
## inventory_SP  5 2660.210
## inventory_QS  4 2612.784

The ‘QS’ regression model is

\[ Surplus\_Inventory = 652.1385 - 6.6337 \times Quantity\_Sold - 24.3457 \times Lead\_Time\_Days \]

or expanding the left hand side and rearranging,

\[ \begin{aligned} Inventory\_Levels &= 652.1385 - 6.6337 \times Quantity\_Sold - 24.3457 \times Lead\_Time\_Days\ + \\ & \qquad Quantity\_Sold \times Lead\_Time\_Days \end{aligned} \]

where

\(Quantity\_Sold = \frac{Sales}{Price}\)

 

Interpretation of coefficients

  • 652.1385 is the surplus inventory level when all other variables are zero

  • -6.6337 is the expected decrease in surplus inventory level for an increase in Quantity_Sold by 1 while holding all other variables constant. This makes sense that inventory decreases as products are sold

  • -24.3457 is the expected decrease in surplus inventory level for an increase in Lead_Time_Days by 1 day while holding all other variables constant. This makes sense that longer lead times mean that inventory continues to decrease while waiting for a restock.

Residual analysis

  • Normality of residuals - The distribution of residuals is similar to that of the linear model derived in Approach 1, ie close to normal but with a slight right skew.
residuals <- residuals(inventory_QS)
n_bins <- 30
hist(residuals, breaks = seq(min(residuals), max(residuals), length.out = n_bins + 1), 
     main = "Residuals of Inventory_QS Model", xlab = "Residuals", ylab = "Count")

Calculate skewness

skewness(inventory_QS$residuals)
## [1] 0.09605207

The Q-Q plot shows that residuals are mostly on the diagonal, although they deviate somewhat at quantiles < -1. Overall, the distribution of the residuals appear to be close to normal.

plot(inventory_QS, which = 2)


The Shapiro-Wilk test P >0.05 so we cannot reject the null hypothesis that the residuals are normally distributed.

shapiro.test(inventory_QS$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  inventory_QS$residuals
## W = 0.99331, p-value = 0.5016
  • Residuals vs fitted plot - The residuals are evenly scattered around \(y=0\), although they tend to be clustered on the right (higher fitted values)
plot(inventory_QS, which = 1)

  • Variance of residuals - The scale-location plot shows that standardized residuals are roughly balanced on either side of the red curve, which indicates equal variance of the residuals.
plot(inventory_QS, which = 3)

Consistent with this result, the Breusch-Pagan test P >0.05 so we cannot reject the null hypothesis that homoscedasticity is present in data (ie, residuals are distributed with equal variance).

require(lmtest)

bptest(inventory_QS)
## 
##  studentized Breusch-Pagan test
## 
## data:  inventory_QS
## BP = 0.28694, df = 2, p-value = 0.8663
  • Residuals vs leverage - The plot shows that residuals have low Cook’s distance, which indicate the absence of influential data points
plot(inventory_QS, which = 4)

 

Overall interpretation

Although the histogram of the residuals shows a little skew, the Q-Q plot and Shapiro-Wilk test suggest that the residuals of the ‘QS’ model are normally distributed. In addition, the scale-location plot and Breusch-Pagan test show that the residuals are homoscedastic. Therefore, I conclude that this model is appropriate for modeling inventory levels in the synthetic retail dataset and accounts for approximately half of the variation in inventory levels.