I read the synthetic retail data into a dataframe and performed some basic checks on data types, duplicates, and missing data.
url <- 'https://raw.githubusercontent.com/alexandersimon1/Data605/refs/heads/main/synthetic_retail_data.csv'
retail_df <- read_csv(url, show_col_types = FALSE)
glimpse(retail_df)
## Rows: 200
## Columns: 6
## $ Product_ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Sales <dbl> 158.43952, 278.99020, 698.85868, 1832.39467, 459.703…
## $ Inventory_Levels <dbl> 367.4421, 426.6512, 407.6394, 392.3912, 448.3120, 54…
## $ Lead_Time_Days <dbl> 6.314587, 5.800673, 3.071936, 3.534253, 10.802241, 1…
## $ Price <dbl> 18.795197, 26.089636, 22.399985, 27.092013, 18.30782…
## $ Seasonality_Index <dbl> 1.1839497, 0.8573051, 0.6986774, 0.6975404, 0.840725…
duplicate_rows <- nrow(retail_df) - nrow(distinct(retail_df))
sprintf("Duplicate rows in dataset: %d", duplicate_rows)
## [1] "Duplicate rows in dataset: 0"
get_na_columns <- function(df) {
# count NAs in all columns, keeping those with NAs
na_columns <- map(df, ~ sum(is.na(.))) %>%
keep(~ .x > 0)
if (length(na_columns) == 0) {
return("There are no columns with missing values")
}
return(na_columns)
}
get_na_columns(retail_df)
## [1] "There are no columns with missing values"
The data are fine, so now I proceed with the exam problems.
Context: You are a data scientist working for a retail chain that models sales, inventory levels, and the impact of pricing and seasonality on revenue. Your task is to analyze various distributions that can describe sales variability and forecast potential revenue.
Sales variable
from the dataset. Assume it follows a Gamma distribution and estimate
its shape and scale parameters using the fitdistr function
from the MASS package.Answer: shape = 1.834963 and scale = 347
# https://cran.r-project.org/web/packages/MASS/MASS.pdf
require(MASS)
set.seed(128)
# Note that the 'lower' parameter needs to be defined to prevent
# the optim function from trying negative estimates of the parameters
sales_gamma_fit <- fitdistr(retail_df$Sales, "gamma", lower = c(0, 0))
sales_gamma_fit
## shape rate
## 1.834962588 0.002881012
## (0.151175450) (0.000255698)
Calculate the scale
rate <- sales_gamma_fit$estimate[[2]]
scale <- 1 / rate
scale
## [1] 347.1003
Answer: mean = 6.133, standard deviation = 0.3633 (log scale)
inventory_lognormal_fit <- fitdistr(retail_df$Inventory_Levels, "lognormal")
inventory_lognormal_fit
## meanlog sdlog
## 6.13303680 0.36332727
## (0.02569112) (0.01816636)
Lead_Time_Days follows a Normal distribution. Estimate the
mean and standard deviation.Answer: mean = 6.834, standard deviation = 2.083
lead_time_normal_fit <- fitdistr(retail_df$Lead_Time_Days, "normal")
lead_time_normal_fit
## mean sd
## 6.8342981 2.0832137
## (0.1473055) (0.1041607)
Calculate the empirical mean and variance for all three variables.
Compare these empirical values with the theoretical values derived from the estimated distribution parameters.
Answer: In general, the theoretical (predicted) mean and variance were close to the empirical (actual) values. The residuals (ie, theoretical minus empirical values) for inventory (log-normal) was a little larger than those for sales (gamma) and lead time (normal). Calculations are shown below, followed by a summary table of the results.
First, I calculated the empirical means and variances.
# Empirical mean and variance
sales_emp_mean <- mean(retail_df$Sales)
sales_emp_var <- var(retail_df$Sales)
inventory_emp_mean <- mean(retail_df$Inventory_Levels)
inventory_emp_var <- var(retail_df$Inventory_Levels)
lead_emp_mean <- mean(retail_df$Lead_Time_Days)
lead_emp_var <- var(retail_df$Lead_Time_Days)
# Create data table
empirical_data = matrix(c(sales_emp_mean, sales_emp_var,
inventory_emp_mean, inventory_emp_var,
lead_emp_mean, lead_emp_var),
ncol = 2, byrow = TRUE)
colnames(empirical_data) <- c('Empirical_Mean', 'Empirical_Variance')
rownames(empirical_data) <- c('Sales (gamma)',
'Inventory (lognormal)',
'Lead Time (normal)')
emp_table <- as.table(empirical_data)
Then I calculated the theoretical means and variances.
For a gamma distribution,1
\[ mean = \frac{shape}{rate} \]
and
\[ variance = \frac{shape}{rate^2} \]
shape <- sales_gamma_fit$estimate[[1]]
sales_pred_mean <- shape / rate
sales_pred_mean
## [1] 636.916
sales_pred_var <- shape / (rate^2)
sales_pred_var
## [1] 221073.7
For a log-normal distribution,2
\[ mean = e^{(\mu + \frac{\sigma^2}{2})} \]
and
\[ variance = (e^{\sigma^2} - 1)e^{(2\mu + \sigma^2)} \]
inventory_meanlog <- inventory_lognormal_fit$estimate[[1]]
inventory_sdlog <- inventory_lognormal_fit$estimate[[2]]
inventory_pred_mean <- exp(inventory_meanlog + 0.5*inventory_sdlog^2)
inventory_pred_mean
## [1] 492.2763
inventory_pred_var <- (exp(inventory_sdlog^2) - 1) *
exp(2*inventory_meanlog + inventory_sdlog^2)
inventory_pred_var
## [1] 34197.47
For the normal distribution, the mean was already estimated in part 1, so only the variance needs to be calculated from the standard deviation
lead_time_pred_mean <- lead_time_normal_fit$estimate[[1]]
lead_time_pred_mean
## [1] 6.834298
lead_time_pred_sd <- lead_time_normal_fit$estimate[[2]]
lead_time_pred_var <- lead_time_pred_sd ^ 2
lead_time_pred_var
## [1] 4.339779
# Create data table for theoretical mean and variance
theoretical_data = matrix(c(sales_pred_mean, sales_pred_var,
inventory_pred_mean, inventory_pred_var,
lead_time_pred_mean, lead_time_pred_var),
ncol = 2, byrow = TRUE)
colnames(theoretical_data) <- c('Theoretical_Mean', 'Theoretical_Variance')
rownames(theoretical_data) <- c('Sales (gamma)',
'Inventory (lognormal)',
'Lead Time (normal)')
theor_table <- as.table(theoretical_data)
Summary of results
require(kableExtra)
answers <- cbind(emp_table, theor_table)
answers %>%
kbl() %>%
kable_styling()
| Empirical_Mean | Empirical_Variance | Theoretical_Mean | Theoretical_Variance | |
|---|---|---|---|---|
| Sales (gamma) | 636.916210 | 2.148318e+05 | 636.915990 | 2.210737e+05 |
| Inventory (lognormal) | 488.547176 | 2.403945e+04 | 492.276304 | 3.419747e+04 |
| Lead Time (normal) | 6.834298 | 4.361587e+00 | 6.834298 | 4.339779e+00 |
Lead_Time_Days
variable (assumed to be normally distributed), calculate the following
empirical probabilities:Answer: 0.593
By definition
\[ P(Z > \mu \ | \ Z > \mu - \sigma) = \frac{P(Z > \mu \ \cap \ Z > \mu - \sigma)}{P(Z > \mu - \sigma)} \qquad (1) \]
\(\mu\) (lead_emp_mean)
was calculated above. The standard deviation \(\sigma\) is 2.09.
lead_emp_sd <- sd(retail_df$Lead_Time_Days)
lead_emp_sd
## [1] 2.088441
Next, calculate the Z score for each observation, \(Z = \frac{actual\ value\ - \ \mu}{\sigma}\)
require(tidyverse)
retail_df <- retail_df %>%
mutate(
Lead_Time_Z = (Lead_Time_Days - lead_emp_mean) / lead_emp_sd,
.after = Lead_Time_Days
)
Then calculate the numerator of equation (1)
\[ P(Z > \mu \ \cap \ Z > \mu - \sigma) = \frac{Number\ of\ observations(Z > \mu \ \cap \ Z > \mu - \sigma)}{Total\ observations} \]
require(tidyverse)
# Z > mean
Z_gt0 <- retail_df %>%
filter(Lead_Time_Z > 0)
# Z > mean - sd
Z_gt_neg1 <- retail_df %>%
filter(Lead_Time_Z > -1)
# Intersection
intersection_df <- intersect(Z_gt0, Z_gt_neg1)
# Calculate the empirical probability
P_numerator <- nrow(intersection_df) / nrow(retail_df)
P_numerator
## [1] 0.495
Similarly, the denominator of equation (1) is
P_denominator <- nrow(Z_gt_neg1) / nrow(retail_df)
P_denominator
## [1] 0.835
Combining the results, the desired probability is \(P(Z > \mu \ | \ Z > \mu - \sigma) = 0.593\)
P <- P_numerator / P_denominator
P
## [1] 0.5928144
Answer: 0.313
\[ P(Z > \mu + \sigma \ | \ Z > \mu) = \frac{P(Z > \mu + \sigma \ \cap \ Z > \mu)}{P(Z > \mu)} \qquad (1) \]
Calculate the numerator of equation (1)
\[ P(Z > \mu + \sigma \ | \ Z > \mu) = \frac{Number\ of\ observations(Z > \mu + \sigma \ \cap \ Z > \mu)}{Total\ observations} \]
require(tidyverse)
# Z > mean + sd
Z_gt1 <- retail_df %>%
filter(Lead_Time_Z > 1)
# Z > mean was calculated in previous part
# Intersection
intersection_df <- intersect(Z_gt1, Z_gt0)
# Calculate the empirical probability
P_numerator <- nrow(intersection_df) / nrow(retail_df)
P_numerator
## [1] 0.155
Similarly, the denominator of equation (1) is
P_denominator <- nrow(Z_gt0) / nrow(retail_df)
P_denominator
## [1] 0.495
Combining the results, the desired probability is \(P(Z > \mu + \sigma \ | \ Z > \mu) = 0.313\)
P <- P_numerator / P_denominator
P
## [1] 0.3131313
Answer: 0.0303
\[ P(Z > \mu + 2\sigma \ | \ Z > \mu) = \frac{P(Z > \mu + 2\sigma \ \cap \ Z > \mu)}{P(Z > \mu)} \qquad (1) \]
Calculate the numerator of equation (1)
\[ P(Z > \mu + 2\sigma \ | \ Z > \mu) = \frac{Number\ of\ observations(Z > \mu + 2\sigma \ \cap \ Z > \mu)}{Total\ observations} \]
require(tidyverse)
# Z > mean + 2sd
Z_gt2 <- retail_df %>%
filter(Lead_Time_Z > 2)
# Z > mean was calculated previously
# Intersection
intersection_df <- intersect(Z_gt2, Z_gt0)
# Calculate the empirical probability
P_numerator <- nrow(intersection_df) / nrow(retail_df)
P_numerator
## [1] 0.015
Similarly, the denominator of equation (1) is
P_denominator <- nrow(Z_gt0) / nrow(retail_df)
P_denominator
## [1] 0.495
Combining the results, the desired probability is \(P(Z > \mu + 2\sigma \ | \ Z > \mu) = 0.0303\)
P <- P_numerator / P_denominator
P
## [1] 0.03030303
Sales and
Price. Create a contingency table using quartiles of
Sales and Price, and then evaluate the
marginal and joint probabilities.A scatterplot of Sales vs Price does not
appear to have an obvious linear correlation.
ggplot(retail_df, aes(x = Price, y = Sales)) +
geom_point() +
labs(x = 'Price ($)', y = 'Sales ($)') +
ggtitle('Price vs sales in synthetic retail dataset')
A simple linear regression model of Sales vs
Price shows that the adjusted \(R^2 = 0.00556\), meaning that 0.556% of the
variation in Sales is explained by the variation in the
model. This is very low and indicates that Sales and
Price are not correlated linearly.
The F-statistic, which indicates whether the linear regression model is a better fit to the data than a model without any independent variables, corroborates this result. Since \(p = 0.148 > 0.05\), we cannot reject the null hypothesis that a model with no independent variables fits the data as well as the linear regression model.
retail_lm <- lm(Sales ~ Price, data = retail_df)
summary(retail_lm)
##
## Call:
## lm(formula = Sales ~ Price, data = retail_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -679.54 -347.85 -98.63 241.12 1770.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 442.951 137.419 3.223 0.00148 **
## Price 9.916 6.824 1.453 0.14775
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 462.2 on 198 degrees of freedom
## Multiple R-squared: 0.01055, Adjusted R-squared: 0.005556
## F-statistic: 2.112 on 1 and 198 DF, p-value: 0.1478
Quartiles of Price
price_quantiles <- quantile(retail_df$Price, probs = c(0, 0.25, 0.5, 0.75, 1))
price_quantiles
## 0% 25% 50% 75% 100%
## 5.053035 16.553627 19.976877 22.923960 29.404018
Quartiles of Sales
sales_quantiles <- quantile(retail_df$Sales, probs = c(0, 0.25, 0.5, 0.75, 1))
sales_quantiles
## 0% 25% 50% 75% 100%
## 25.57477 284.42235 533.54023 867.58104 2447.48868
The task did not specify which quantile to use to construct the
contingency table and evaluate marginal and joint probabilities, so I
just selected Q1. Then I counted the number of observations that
satisfied the four possibilities (ie, Price greater/less
than Q1 \(\times\) Sales
greater/less than Q1).
# Row 1
# Sales <= 25% quantile & Price <= 25% quantile
c11 <- nrow(filter(retail_df, Sales <= sales_quantiles[2] & Price <= price_quantiles[2]))
# Sales <= 25% quantile & Price > 25% quantile
c12 <- nrow(filter(retail_df, Sales <= sales_quantiles[2] & Price > price_quantiles[2]))
# Row total
c13 <- c11 + c12
# Row 2
# Sales > 25% quantile & Price <= 25% quantile
c21 <- nrow(filter(retail_df, Sales > sales_quantiles[2] & Price <= price_quantiles[2]))
# Sales > 25% quantile & Price > 25% quantile
c22 <- nrow(filter(retail_df, Sales > sales_quantiles[2] & Price > price_quantiles[2]))
# Row total
c23 <- c21 + c22
# Row 3
# Column 1 total
c31 <- c11 + c21
# Column 2 total
c32 <- c12 + c22
# Column 3 total
c33 <- c13 + c23
Contingency table
contingency_data = matrix(c(c11, c12, c13,
c21, c22, c23,
c31, c32, c33), ncol = 3, byrow = TRUE)
colnames(contingency_data) <- c('Price<=Q1', 'Price>Q1', 'Total')
rownames(contingency_data) <- c('Sales<=Q1', 'Sales>Q1', 'Total')
contingency_table <- as.table(contingency_data)
contingency_table
## Price<=Q1 Price>Q1 Total
## Sales<=Q1 11 39 50
## Sales>Q1 39 111 150
## Total 50 150 200
Marginal probabilities
Using the values in the contingency table,
\[ P(Price\ \le \ Q1) = \frac{50}{200} = 0.25 \]
\[ P(Price\ > \ Q1) = \frac{150}{200} = 0.75 \]
\[ P(Sales\ \le \ Q1) = \frac{50}{200} = 0.25 \]
\[ P(Sales\ > \ Q1) = \frac{150}{200} = 0.75 \]
Joint probabilities
Using the values in the contingency table,
\[ P(Price\ \le \ Q1 \ \cap \ Sales\ \le \ Q1) = \frac{11}{200} = 0.055 \]
\[ P(Price\ > \ Q1 \ \cap \ Sales\ \le \ Q1) = \frac{39}{200} = 0.195 \]
\[ P(Price\ \le \ Q1 \ \cap \ Sales\ > \ Q1) = \frac{39}{200} = 0.195 \]
\[ P(Price\ > \ Q1 \ \cap \ Sales\ > \ Q1) = \frac{111}{200} = 0.555 \]
Sales and Price. Discuss
which test is most appropriate and why.Chi-square test
\(p > 0.05\) so we cannot reject
the null hypothesis that there is no relationship between
Sales and Price.
# Remove marginal totals from contingency table
contingency_table <- contingency_table[1:2, 1:2]
chisq.test(contingency_table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency_table
## X-squared = 0.14222, df = 1, p-value = 0.7061
Fisher’s exact test
The p-value was nearly equal to that from the Chi-square test. \(p > 0.05\) so we cannot reject the null
hypothesis that there is no relationship between Sales and
Price.
fisher.test(contingency_table)
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table
## p-value = 0.7066
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.3372442 1.7985136
## sample estimates:
## odds ratio
## 0.8036309
Discussion: In this exercise, the Chi-square and Fisher’s exact tests give nearly the same p-value and both indicated that Sales and Price variables are not related. Fisher’s exact test is preferred as long as it is computationally feasible, because it is more accurate than the chi-square test (as implied by its name). Fisher’s exact test is also needed when expected counts are very small or marginal values are grossly unbalanced.3
Context: You are working for a large retail chain that wants to optimize pricing, inventory management, and sales forecasting using data-driven strategies. Your task is to use regression, statistical modeling, and calculus-based methods to make informed decisions.
Inventory_Levels and Sales variables.summary(retail_df$Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.57 284.42 533.54 636.92 867.58 2447.49
summary(retail_df$Inventory_Levels)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67.35 376.51 483.72 488.55 600.42 858.79
Inventory_Levels, Sales, and
Price.I pivoted the variables of interest from wide to long to group plots together.
require(tidyverse)
retail_long_df <- retail_df %>%
pivot_longer(names_to = 'parameter', values_to = 'value',
cols = c('Sales', 'Inventory_Levels', 'Price'))
The histograms show that the distribution of
Inventory_Levels is relatively symmetrical (reflecting the
approximately equal mean [484] and median [489] in the summary
statistics above), whereas Price is skewed to the right and
Sales is skewed to the left.
ggplot(retail_long_df, aes(x = value, fill = parameter)) +
geom_histogram() +
facet_wrap(~parameter, ncol = 1, scales = 'free') +
theme(
legend.position = 'none'
)
The boxplots below show that Price and
Sales have several outliers.
ggplot(retail_long_df, aes(x = parameter, y = value, color = parameter)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~parameter, ncol = 1, scales = 'free') +
theme(
legend.position = 'none'
)
A scatterplot of Sales vs Price does not
appear to have an obvious linear correlation.
ggplot(retail_df, aes(x = Price, y = Sales)) +
geom_point() +
labs(x = 'Price ($)', y = 'Sales ($)') +
ggtitle('Price vs sales in synthetic retail dataset')
A scatterplot of Price vs Inventory_Levels
does not appear to have an obvious linear correlation.
ggplot(retail_df, aes(x = Inventory_Levels, y = Price)) +
geom_point() +
labs(x = 'Inventory Level', y = 'Price ($)') +
ggtitle('Inventory level vs price in synthetic retail dataset')
A scatterplot of Inventory_Levels vs Sales
does not appear to have an obvious linear correlation.
ggplot(retail_df, aes(x = Inventory_Levels, y = Sales)) +
geom_point() +
labs(x = 'Inventory Level', y = 'Sales ($)') +
ggtitle('Inventory level vs sales in synthetic retail dataset')
Sales,
Price, and Inventory_LevelsThe correlation matrix below shows that the pairwise correlations between these variables is generally very low (<0.1).
retail_cor <- cor(retail_df[c('Sales', 'Price', 'Inventory_Levels')])
retail_cor
## Sales Price Inventory_Levels
## Sales 1.00000000 0.10272730 -0.03529619
## Price 0.10272730 1.00000000 -0.04025941
## Inventory_Levels -0.03529619 -0.04025941 1.00000000
The correlation plot below shows this graphically. The light blue and light pink circles off the main diagonal shows that the correlation coefficients are close to zero.
require(corrplot)
# diag=FALSE hides the circles on the main diagonal as they are visually distracting
# and not relevant
corrplot(retail_cor, diag = FALSE, outline = TRUE, tl.col = "black", bg = "darkgray")
I assessed pairwise correlations between Sales,
Price, and Inventory_Levels using Pearson
correlation tests.
For Sales vs Price, the sample estimate of
the correlation coefficient is 0.103 with a 95% confidence interval of
(-0.0365, 0.2381). Since the confidence interval includes zero and \(p = 0.1 > 0.05\), we cannot reject the
null hypothesis that the true correlation is zero.
cor.test(retail_df$Sales, retail_df$Price, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: retail_df$Sales and retail_df$Price
## t = 1.4532, df = 198, p-value = 0.1478
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03653442 0.23807516
## sample estimates:
## cor
## 0.1027273
For Sales vs Inventory_Levels, the sample
estimate of the correlation coefficient is -0.0353 with a 95% confidence
interval of (-0.173, 0.104). Since the confidence interval includes zero
and \(p = 0.6 > 0.05\), we cannot
reject the null hypothesis that the true correlation is zero.
cor.test(retail_df$Sales, retail_df$Inventory_Levels, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: retail_df$Sales and retail_df$Inventory_Levels
## t = -0.49697, df = 198, p-value = 0.6198
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1731891 0.1039539
## sample estimates:
## cor
## -0.03529619
For Price vs Inventory_Levels, the sample
estimate of the correlation coefficient is -0.0403 with a 95% confidence
interval of (-0.178, 0.099). Since the confidence interval includes zero
and \(p = 0.6 > 0.05\), we cannot
reject the null hypothesis that the true correlation is zero.
cor.test(retail_df$Price, retail_df$Inventory_Levels, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: retail_df$Price and retail_df$Inventory_Levels
## t = -0.56696, df = 198, p-value = 0.5714
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17800614 0.09903478
## sample estimates:
## cor
## -0.04025941
The correlation matrix and Pearson correlation tests show that, based on the sample retail data, Sales, Price, and Inventory_Level are not significantly correlated with each other. This suggests that, in the population (ie, all retail data that the business generates), the true correlation between these variables is zero. This would mean that inventory cannot be predicted from product price or sales using a linear relationship and would need to be managed using some other strategy (eg, nonlinear).
I would not be concerned about multicollinearity in a regression model because none of the variables are highly correlated with each other.
Sales and Price (assuming Sales
as the dependent variable).I performed this analysis in Problem 1, Task 2, so I only show the regression summary below.
The regression equation is \(Sales = 9.92
\times Price + 442.95\). However, the adjusted \(R^2 = 0.00556\) is very low and the
F-statistic p-value is not statistically significant, so the model
accounts for <1% of the variation in Sales and is not
any better than a model with no independent variables.
summary(retail_lm)
##
## Call:
## lm(formula = Sales ~ Price, data = retail_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -679.54 -347.85 -98.63 241.12 1770.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 442.951 137.419 3.223 0.00148 **
## Price 9.916 6.824 1.453 0.14775
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 462.2 on 198 degrees of freedom
## Multiple R-squared: 0.01055, Adjusted R-squared: 0.005556
## F-statistic: 2.112 on 1 and 198 DF, p-value: 0.1478
# correlation matrix
sales_price_cor <- cor(retail_df[c('Sales', 'Price')])
# invert correlation matrix
precision_matrix <- solve(sales_price_cor)
precision_matrix
## Sales Price
## Sales 1.0106655 -0.1038229
## Price -0.1038229 1.0106655
Answer:
The variance inflation factor (VIF) is a measure of the degree to which the variance of a regression coefficient is inflated due to multicollinearity and is related to the \(R^2\) of a regression model that predicts \(x_j\) using all other predictors \(x_{i \ne j}\).4
\[ VIF(x_j) = \frac{1}{1 - R^2_{x_j|x_i \ (i \ne j)}} \]
In the precision matrix above, \(VIF =
1.011\), which means that the regression coefficients in the
regression model are only slightly inflated due to multicollinearity.
The corresponding \(R^2_{x_j|x_i \ (i \ne j)}
= 0.0188\) further indicates that Sales and
Price are not correlated with each other.
The LU decomposition of the correlation matrix is shown below.
require(matlib)
matlib::LU(sales_price_cor)
## $P
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
## $L
## [,1] [,2]
## [1,] 1.0000000 0
## [2,] 0.1027273 1
##
## $U
## Sales Price
## [1,] 1 0.1027273
## [2,] 0 0.9894471
Answer:
According to a Wikipedia article about price elasticity of demand,5 demand can be modeled as a linear function of price, ie \(Demand = Elasticity \times Price\), where elasticity reflects the change in demand for a product relative to change in price.
Assuming that sales is a function of demand, then the elasticity
matrix is the correlation matrix between Sales and
Price. The lower left element of the lower triangular
matrix \(L\) and the upper right
element of the upper triangular matrix \(U\) (0.103) indicate that
Sales changes at approximately one-tenth the rate of change
in Price (and vice versa). This suggests that the products
are relatively inelastic and that sales (demand) will not change much if
prices increase.6
Sales or Price) and fit an exponential
distribution to this data using the fitdistr function.As shown in the histograms for these variables (Problem 2 part 1),
Price is skewed to the right.
require(MASS)
set.seed(128)
price_exp_fit <- fitdistr(retail_df$Price, "exponential", lower = 0)
price_exp_fit
## rate
## 0.051124252
## (0.003615031)
Because the empirical and fitted data are on different scales, I normalized the data.
require(dplyr)
# Generate a sample of fitted data
rate <- price_exp_fit$estimate[[1]]
price_exp_fitted <- tibble(value = rexp(1000, rate))
# Normalize the values
price_exp_norm <- price_exp_fitted %>%
mutate(
normalized = (value - min(value)) / (max(value) - min(value)),
dataset = 'fitted'
) %>%
dplyr::select(-c(value))
Normalize the empirical prices
require(dplyr)
price_emp_norm <- retail_df %>%
dplyr::select(value = Price) %>%
mutate(
normalized = (value - min(value)) / (max(value) - min(value)),
dataset = 'empirical',
.after = value
) %>%
dplyr::select(-c(value))
Combine the empirical and fitted dataframes
price_df <- rbind(price_exp_norm, price_emp_norm)
The density histograms below show that the distribution of the
empirical Price data does not match the distribution of the
fitted exponential data.
require(ggplot2)
ggplot(price_df, aes(x = normalized, fill = dataset, color = dataset)) +
geom_histogram(aes(y = after_stat(density)), alpha = 0.5) +
labs(x = 'Normalized Price', y = 'Density') +
facet_wrap(~dataset) +
ggtitle("Histograms of empirical vs fitted exponential data")
5th percentile
pexp(0.05, rate)
## [1] 0.002552948
95th percentile
pexp(0.95, rate)
## [1] 0.04740748
If the data are normally distributed, then 95% will be within \(\pm 2\) standard deviations of the mean or between $9.96 and $29.16.
price_mean <- mean(retail_df$Price)
price_sd <- sd(retail_df$Price)
lower_bound <- price_mean - 2*price_sd
upper_bound <- price_mean + 2*price_sd
sprintf("95%% of the prices would be between $%.2f and $%.2f", lower_bound, upper_bound)
## [1] "95% of the prices would be between $9.96 and $29.16"
The actual 5th and 95th percentiles of Price in the
synthetic retail dataset are $11.41 and $27.03, respectively.
quantile(retail_df$Price, c(0.05, 0.95))
## 5% 95%
## 11.41108 27.03023
Answer: As shown by the histograms above, an exponential distribution model does not fit the empirical price data very well. This means that predicting future prices using an exponential model would not be very accurate. Below, I show that a normal distribution would be a more appropriate model.
Fit a normal distribution to the data
require(MASS)
set.seed(128)
price_normal_fit <- fitdistr(retail_df$Price, "normal")
price_normal_fit
## mean sd
## 19.5601883 4.7895520
## ( 0.3386725) ( 0.2394776)
Generate samples from the fitted normal distribution
require(dplyr)
# Generate a sample of fitted data
mean_estimate <- price_normal_fit$estimate[[1]]
sd_estimate <- price_normal_fit$estimate[[2]]
price_normal_fitted <- tibble(value = rnorm(1000, mean_estimate, sd_estimate))
# Normalize the values
price_normal_norm <- price_normal_fitted %>%
mutate(
normalized = (value - min(value)) / (max(value) - min(value)),
dataset = 'fitted'
) %>%
dplyr::select(-c(value))
# Combine dataframes
price_df2 <- rbind(price_normal_norm, price_emp_norm)
The density histograms show that the distribution of the empirical
Price data is a much closer match to the distribution of
the fitted normal data.
require(ggplot2)
ggplot(price_df2, aes(x = normalized, fill = dataset, color = dataset)) +
geom_histogram(aes(y = after_stat(density)), alpha = 0.5) +
labs(x = 'Normalized Price', y = 'Density') +
facet_wrap(~dataset) +
ggtitle("Histograms of empirical vs fitted normal data")
Inventory_Levels based on Sales,
Lead_Time_Days, and Price.My initial thought was to create a multiple linear regression model
of Inventory_Levels as the dependent variable and
Sales, Lead_Time_Days, and Price
as the independent variables. However, none of the p-values of the
coefficients were significant (ie, all \(p>0.05\)), not including the intercept.
In addition, \(R^2 = 0.0122\) means
that only about 1% of the variation in inventory levels could be
explained by the model. Finally, the F-statistic \(p=0.49 > 0.05\) indicates that the model
does not perform any better than a model with no independent
variables.
inventory_lm <- lm(Inventory_Levels ~ Sales + Lead_Time_Days + Price, data = retail_df)
summary(inventory_lm)
##
## Call:
## lm(formula = Inventory_Levels ~ Sales + Lead_Time_Days + Price,
## data = retail_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.54 -118.07 -7.68 111.81 372.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 464.792662 61.321852 7.580 1.35e-12 ***
## Sales -0.007809 0.023955 -0.326 0.745
## Lead_Time_Days 7.316793 5.293049 1.382 0.168
## Price -1.087778 2.305846 -0.472 0.638
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 155.3 on 196 degrees of freedom
## Multiple R-squared: 0.01223, Adjusted R-squared: -0.002887
## F-statistic: 0.8091 on 3 and 196 DF, p-value: 0.4902
Residual analysis
mean(inventory_lm$residuals ^ 2)
## [1] 23626.66
residuals <- residuals(inventory_lm)
n_bins <- 30
hist(residuals, breaks = seq(min(residuals), max(residuals), length.out = n_bins + 1),
main = "Histogram of Residuals", xlab = "Residuals", ylab = "Count")
Calculating the skewness confirms that there is a slight positive skew.
require(moments)
skewness(inventory_lm$residuals)
## [1] 0.1047673
The Q-Q plot suggests that the residuals are not completely normal, with residuals deviating from the diagonal at quantiles less than -1.
plot(inventory_lm, which = 2)
However, the Shapiro-Wilk test P>0.05, so we cannot reject the null hypothesis that the residuals are normally distributed.
shapiro.test(inventory_lm$residuals)
##
## Shapiro-Wilk normality test
##
## data: inventory_lm$residuals
## W = 0.99271, p-value = 0.424
plot(inventory_lm, which = 1)
plot(inventory_lm, which = 3)
In addition, the Breusch-Pagan test P>0.05 so we cannot reject the null hypothesis that homoscedasticity is present in data, which agrees with the scale-location plot.
require(lmtest)
bptest(inventory_lm)
##
## studentized Breusch-Pagan test
##
## data: inventory_lm
## BP = 3.6206, df = 3, p-value = 0.3055
Interpretation: The residual analysis suggests that the model is appropriate; however, the low \(R^2\) and F statistic P-value >0.05 indicate that the model accounts for very little of the variation in inventory levels (ie, it’s not useful for prediction).
Next, I investigated whether data transformation could improve the model. The residual analysis above (Approach 1) indicated that the residuals are normally distributed and have constant variance, which suggests that the dependent variable does not need to be transformed.
Consistent with this, a Box-Cox power transformation showed that the optimal power of the dependent variable is 0.858 (ie, close to 1).7,8
require(car)
powerTransform(lm(retail_df$Inventory_Levels ~ 1))
## Estimated transformation parameter
## Y1
## 0.8578132
The boxcox() function in the MASS package shows this
visually:
require(MASS)
boxcox(lm(retail_df$Inventory_Levels ~ 1))
To assess whether the independent variables would benefit from transformation, I tried to perform a Box-Tidwell transformation,9 but I got an error that I couldn’t resolve.
require(car)
boxTidwell(Inventory_Levels ~ Sales + Price + Lead_Time_Days, data = retail_df)
## Error in lm.fit(cbind(1, x1.p, x2), y, ...): NA/NaN/Inf in 'x'
I then tried transforming Inventory_Levels based on the
concepts of lead time and inventory turnover.10 I did this by
defining a parameter that I call “surplus inventory”, which reflects the
inventory level in excess of what is needed to meet demand while waiting
for replenishment of products that have been sold. I also deseasonalized
the Price data (which I use as a proxy for demand) to
remove seasonal effects.11
These modifications resulted in a model with an adjusted \(R^2 = 0.421\), which means that it explains
42.1% of the variation in Surplus_Inventory.
retail_df <- retail_df %>%
mutate(
Price_Deseasonalized = Price / Seasonality_Index,
Quantity_Sold = Sales / Price_Deseasonalized,
Surplus_Inventory = Inventory_Levels - Quantity_Sold * Lead_Time_Days,
.after = Price
)
The coefficients of all terms were statistically significant and the F-statistic \(P<0.05\).
inventory_SP <- lm(Surplus_Inventory ~ Sales + Lead_Time_Days + Price, data = retail_df)
summary(inventory_SP)
##
## Call:
## lm(formula = Surplus_Inventory ~ Sales + Lead_Time_Days + Price,
## data = retail_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -658.96 -121.89 17.11 128.38 471.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 349.94359 72.79568 4.807 3.04e-06 ***
## Sales -0.31565 0.02844 -11.100 < 2e-16 ***
## Lead_Time_Days -23.02149 6.28342 -3.664 0.00032 ***
## Price 13.89741 2.73729 5.077 8.88e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 184.3 on 196 degrees of freedom
## Multiple R-squared: 0.4298, Adjusted R-squared: 0.4211
## F-statistic: 49.25 on 3 and 196 DF, p-value: < 2.2e-16
The MSE of the residuals is 33295.31
mean(inventory_SP$residuals ^ 2)
## [1] 33295.31
The adjusted \(R^2\) increased to
0.541 by using Quantity_Sold in the regression instead of
Sales and Price.
inventory_QS <- lm(Surplus_Inventory ~ Quantity_Sold + Lead_Time_Days,
data = retail_df)
summary(inventory_QS)
##
## Call:
## lm(formula = Surplus_Inventory ~ Quantity_Sold + Lead_Time_Days,
## data = retail_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.56 -124.09 -13.33 113.65 390.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 652.1385 43.4906 14.995 < 2e-16 ***
## Quantity_Sold -6.6337 0.4415 -15.025 < 2e-16 ***
## Lead_Time_Days -24.3457 5.5850 -4.359 2.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 164.1 on 197 degrees of freedom
## Multiple R-squared: 0.5457, Adjusted R-squared: 0.541
## F-statistic: 118.3 on 2 and 197 DF, p-value: < 2.2e-16
The MSE of the residuals is 26530.25
mean(inventory_QS$residuals ^ 2)
## [1] 26530.25
The MSE of the model derived from Approach 1 was lower (23626.66),
but given that its \(R^2\) was close to
0, the inventory_QS model appears to be better despite
having a higher MSE.
The Akaike information criterion (AIC) test results are similar (lower AIC is preferred).
AIC(inventory_lm, inventory_SP, inventory_QS)
## df AIC
## inventory_lm 5 2591.602
## inventory_SP 5 2660.210
## inventory_QS 4 2612.784
The ‘QS’ regression model is
\[ Surplus\_Inventory = 652.1385 - 6.6337 \times Quantity\_Sold - 24.3457 \times Lead\_Time\_Days \]
or expanding the left hand side and rearranging,
\[ \begin{aligned} Inventory\_Levels &= 652.1385 - 6.6337 \times Quantity\_Sold - 24.3457 \times Lead\_Time\_Days\ + \\ & \qquad Quantity\_Sold \times Lead\_Time\_Days \end{aligned} \]
where
\(Quantity\_Sold = \frac{Sales}{Price}\)
Interpretation of coefficients
652.1385 is the surplus inventory level when all other variables are zero
-6.6337 is the expected decrease in surplus inventory level for
an increase in Quantity_Sold by 1 while holding all other
variables constant. This makes sense that inventory decreases as
products are sold
-24.3457 is the expected decrease in surplus inventory level for
an increase in Lead_Time_Days by 1 day while holding all
other variables constant. This makes sense that longer lead times mean
that inventory continues to decrease while waiting for a
restock.
Residual analysis
residuals <- residuals(inventory_QS)
n_bins <- 30
hist(residuals, breaks = seq(min(residuals), max(residuals), length.out = n_bins + 1),
main = "Residuals of Inventory_QS Model", xlab = "Residuals", ylab = "Count")
Calculate skewness
skewness(inventory_QS$residuals)
## [1] 0.09605207
The Q-Q plot shows that residuals are mostly on the diagonal, although they deviate somewhat at quantiles < -1. Overall, the distribution of the residuals appear to be close to normal.
plot(inventory_QS, which = 2)
The Shapiro-Wilk test P >0.05 so we cannot reject the null hypothesis that the residuals are normally distributed.
shapiro.test(inventory_QS$residuals)
##
## Shapiro-Wilk normality test
##
## data: inventory_QS$residuals
## W = 0.99331, p-value = 0.5016
plot(inventory_QS, which = 1)
plot(inventory_QS, which = 3)
Consistent with this result, the Breusch-Pagan test P >0.05 so we cannot reject the null hypothesis that homoscedasticity is present in data (ie, residuals are distributed with equal variance).
require(lmtest)
bptest(inventory_QS)
##
## studentized Breusch-Pagan test
##
## data: inventory_QS
## BP = 0.28694, df = 2, p-value = 0.8663
plot(inventory_QS, which = 4)
Although the histogram of the residuals shows a little skew, the Q-Q plot and Shapiro-Wilk test suggest that the residuals of the ‘QS’ model are normally distributed. In addition, the scale-location plot and Breusch-Pagan test show that the residuals are homoscedastic. Therefore, I conclude that this model is appropriate for modeling inventory levels in the synthetic retail dataset and accounts for approximately half of the variation in inventory levels.
https://www.biostathandbook.com/fishers.html
https://statisticsbyjim.com/hypothesis-testing/fishers-exact-test/↩︎
Fieberg J. (2022). Statistics for Ecologists, chapter 6.3, pages 127-128. Available at: https://fw8051statistics4ecologists.netlify.app/multicol.html↩︎
Price elasticity: The role of inelasticity in pricing strategies. https://www.fastercapital.com/content/Price-elasticity--The-Role-of-Inelasticity-in-Pricing-Strategies.html↩︎
Box Cox transformation in R. https://www.r-bloggers.com/2022/10/box-cox-transformation-in-r/↩︎
Box-Cox transformation requires positive definite variables, which is the case for the variables in the synthetic retail dataset.↩︎
Transforming predictors in regression: Box Tidwell. https://www.youtube.com/watch?v=V_Ut4O1qyqM↩︎
https://fastercapital.com/content/Lead-time--Enhancing-Supply-Chain-Efficiency-with-Inventory-Turnover.html↩︎
https://www.inventoryops.com/articles/seasonality-indexes.html↩︎