In Quiz 3, my goal is to extend our team’s Mini Project 2 analysis by selecting additional unanswered questions from each section of the provided question list. I provide evidence from the Walmart dataset using one simple visual, summary table, or model output per question, and then interpret what it means for the CEO.
# ------------------------------------------------------------
# 1.1 Load data from Excel
# ------------------------------------------------------------
walmart_raw <- read_excel("WalmartData.xlsx")
# Quick check of columns & types
glimpse(walmart_raw)## Rows: 6,435
## Columns: 8
## $ Store <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ Date <chr> "05-02-2010", "12-02-2010", "19-02-2010", "26-02-2010", "…
## $ Weekly_Sales <chr> "1643690.9", "1641957.44", "1611968.17", "1409727.59", "1…
## $ Holiday_Flag <chr> "0", "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
## $ Temperature <chr> "42.31", "38.51", "39.93", "46.63", "46.5", "57.79", "54.…
## $ Fuel_Price <chr> "2.572", "2.548", "2.514", "2.561", "2.625", "2.667", "2.…
## $ CPI <chr> "211.0963582", "211.2421698", "211.2891429", "211.3196429…
## $ Unemployment <chr> "8.106", "8.106", "8.106", "8.106", "8.106", "8.106", "8.…
# ------------------------------------------------------------
# 1.2 Clean and prepare (very important for correct analysis)
# ------------------------------------------------------------
walmart <- walmart_raw %>%
mutate(
# Convert Date to a proper date format.
# If your Date is already a date, this will keep it as date.
Date = as.Date(Date),
# Convert key numeric variables to numeric (avoids "character" errors in sum/lm)
Weekly_Sales = as.numeric(Weekly_Sales),
Temperature = as.numeric(Temperature),
Fuel_Price = as.numeric(Fuel_Price),
CPI = as.numeric(CPI),
Unemployment = as.numeric(Unemployment),
# Holiday flag should be 0/1 numeric
Holiday_Flag = as.integer(Holiday_Flag),
# Create time variables for later analysis
Year = year(Date),
Month = month(Date, label = TRUE, abbr = TRUE)
)
# Missing values check (simple and beginner-friendly)
missing_summary <- walmart %>%
summarise(across(everything(), ~ sum(is.na(.))))
missing_summaryInterpretation (Data Prep):
This step ensures that numeric operations and regression models work
correctly (e.g., sum() and lm() require
numeric variables). Creating Year and Month
supports trend/seasonality controls in later steps. Without these
checks, results may reflect data type issues rather than business
reality.
# ------------------------------------------------------------
# Aggregate weekly macro variables over time (monthly average)
# ------------------------------------------------------------
macro_monthly <- walmart %>%
mutate(MonthStart = floor_date(Date, unit = "month")) %>%
group_by(MonthStart) %>%
summarise(
CPI_avg = mean(CPI, na.rm = TRUE),
Unemp_avg = mean(Unemployment, na.rm = TRUE),
Temp_avg = mean(Temperature, na.rm = TRUE),
.groups = "drop"
)
# Reshape to long format for easier plotting
macro_long <- macro_monthly %>%
pivot_longer(cols = c(CPI_avg, Unemp_avg, Temp_avg),
names_to = "MacroVar",
values_to = "Value")
ggplot(macro_long, aes(x = MonthStart, y = Value)) +
geom_line(alpha = 0.8) +
facet_wrap(~ MacroVar, scales = "free_y", ncol = 1) +
labs(
title = "Macro Variables Over Time (Monthly Averages)",
x = "Month",
y = "Value"
)Interpretation (Q9):
Macro variables change over time and can shift the baseline conditions
for sales. This matters because a CEO might incorrectly attribute sales
changes to holidays or store performance when some variation is actually
driven by inflation (CPI) or economic pressure (unemployment). This also
supports why macro controls are important in regression.
# ------------------------------------------------------------
# Simple regression: Weekly_Sales explained only by Unemployment
# ------------------------------------------------------------
m_simple_unemp <- lm(Weekly_Sales ~ Unemployment, data = walmart)
summary(m_simple_unemp)##
## Call:
## lm(formula = Weekly_Sales ~ Unemployment, data = walmart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -844415 -481049 -69658 369648 2794876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1302485 30645 42.503 <2e-16 ***
## Unemployment -31944 3730 -8.564 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 561200 on 6433 degrees of freedom
## Multiple R-squared: 0.01127, Adjusted R-squared: 0.01112
## F-statistic: 73.35 on 1 and 6433 DF, p-value: < 2.2e-16
# ------------------------------------------------------------
# Visualize the relationship (scatter + fitted line)
# ------------------------------------------------------------
ggplot(walmart, aes(x = Unemployment, y = Weekly_Sales)) +
geom_point(alpha = 0.15) +
geom_smooth(method = "lm", se = TRUE) +
scale_y_continuous(labels = comma) +
labs(
title = "Simple Relationship: Weekly Sales vs Unemployment",
x = "Unemployment Rate",
y = "Weekly Sales"
)Interpretation (Q13):
This model checks whether unemployment alone is associated with weekly
sales. Even if the coefficient is statistically significant, the key
managerial question is whether the relationship is strong enough to
matter (look at the slope direction and R-squared). This also sets up
why we later need multiple regression controls.
# ------------------------------------------------------------
# Multicollinearity means predictors move together (correlation).
# High correlation can make coefficients unstable.
# ------------------------------------------------------------
predictors <- walmart %>%
select(Temperature, Fuel_Price, CPI, Unemployment) %>%
drop_na()
cor_matrix <- cor(predictors)
cor_matrix## Temperature Fuel_Price CPI Unemployment
## Temperature 1.0000000 0.14498181 0.1768877 0.10115786
## Fuel_Price 0.1449818 1.00000000 -0.1706418 -0.03468374
## CPI 0.1768877 -0.17064180 1.0000000 -0.30202006
## Unemployment 0.1011579 -0.03468374 -0.3020201 1.00000000
# ------------------------------------------------------------
# Simple correlation heatmap (beginner-friendly)
# ------------------------------------------------------------
cor_df <- as.data.frame(as.table(cor_matrix)) %>%
rename(Var1 = Var1, Var2 = Var2, Correlation = Freq)
ggplot(cor_df, aes(x = Var1, y = Var2, fill = Correlation)) +
geom_tile() +
geom_text(aes(label = round(Correlation, 2)), size = 3) +
labs(
title = "Correlation Among Macro Predictors (Multicollinearity Check)",
x = "",
y = ""
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Interpretation (Q30):
If CPI and unemployment (or fuel price) are strongly correlated,
coefficient estimates in a multiple regression can become unstable or
change signs. For managerial use, this means we should interpret “which
variable matters most” with caution when predictors overlap in
information.
# ------------------------------------------------------------
# Create a "pre-holiday week" indicator:
# PreHoliday = 1 if next week is a holiday
# ------------------------------------------------------------
walmart_event <- walmart %>%
arrange(Store, Date) %>%
group_by(Store) %>%
mutate(
NextWeekHoliday = lead(Holiday_Flag, 1),
PreHoliday = ifelse(NextWeekHoliday == 1 & Holiday_Flag == 0, 1, 0)
) %>%
ungroup()
# Compare average sales: normal vs pre-holiday vs holiday
event_summary <- walmart_event %>%
mutate(
EventType = case_when(
Holiday_Flag == 1 ~ "Holiday Week",
PreHoliday == 1 ~ "Pre-Holiday Week",
TRUE ~ "Regular Week"
)
) %>%
group_by(EventType) %>%
summarise(
avg_sales = mean(Weekly_Sales, na.rm = TRUE),
.groups = "drop"
)
event_summaryggplot(event_summary, aes(x = EventType, y = avg_sales)) +
geom_col(alpha = 0.85) +
scale_y_continuous(labels = comma) +
labs(
title = "Do Sales Increase Before Holidays?",
x = "",
y = "Average Weekly Sales"
)Interpretation (Q35):
If pre-holiday weeks show higher average sales than regular weeks, it
suggests customers may start shopping earlier (anticipation effect).
This insight complements MP2 by adding timing nuance
(not only holiday week vs non-holiday week).
# ------------------------------------------------------------
# Interaction model: Holiday effect depends on unemployment
# Weekly_Sales ~ Holiday_Flag + Unemployment + Holiday_Flag:Unemployment
# ------------------------------------------------------------
m_interaction <- lm(
Weekly_Sales ~ Holiday_Flag * Unemployment + Year + Month,
data = walmart
)
summary(m_interaction)##
## Call:
## lm(formula = Weekly_Sales ~ Holiday_Flag * Unemployment + Year +
## Month, data = walmart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1099850 -463296 -64648 371480 2571810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1333159.9 34011.8 39.197 < 2e-16 ***
## Holiday_Flag 40215.4 120718.9 0.333 0.739045
## Unemployment -33319.0 3838.2 -8.681 < 2e-16 ***
## Year -1158.9 796.0 -1.456 0.145447
## Month.L 183599.0 26487.4 6.932 4.57e-12 ***
## Month.Q 93819.1 26426.6 3.550 0.000388 ***
## Month.C 184981.2 25142.9 7.357 2.11e-13 ***
## Month^4 52091.9 25251.4 2.063 0.039160 *
## Month^5 42413.2 25513.0 1.662 0.096478 .
## Month^6 -73365.7 25047.8 -2.929 0.003412 **
## Month^7 -15437.2 24259.0 -0.636 0.524573
## Month^8 -31571.0 23982.7 -1.316 0.188085
## Month^9 1600.1 23101.4 0.069 0.944782
## Month^10 6872.5 23350.0 0.294 0.768520
## Month^11 29112.3 22903.4 1.271 0.203742
## Holiday_Flag:Unemployment -745.7 14485.6 -0.051 0.958944
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 556000 on 6419 degrees of freedom
## Multiple R-squared: 0.03155, Adjusted R-squared: 0.02929
## F-statistic: 13.94 on 15 and 6419 DF, p-value: < 2.2e-16
Interpretation (Q44):
The interaction term tests whether holiday lift is smaller when
unemployment is higher. If the interaction coefficient is negative, it
suggests that economic stress can reduce holiday-driven spending, which
is a CEO-relevant risk insight.
# ------------------------------------------------------------
# Store-only model: Weekly_Sales explained by Store factor
# This tells us how much variation is "store identity" (baseline differences).
# ------------------------------------------------------------
m_store_only <- lm(Weekly_Sales ~ factor(Store), data = walmart)
summary(m_store_only)$r.squared## [1] 0.9174154
Interpretation (Q53):
A high R-squared here would mean that store identity alone explains a
large share of sales variation (store baselines differ a lot). This
strengthens the argument for store-specific strategies and supports why
segmentation can be valuable.
# ------------------------------------------------------------
# Build store-level features (simple version)
# ------------------------------------------------------------
store_features <- walmart %>%
group_by(Store) %>%
summarise(
avg_sales = mean(Weekly_Sales, na.rm = TRUE),
cv_sales = sd(Weekly_Sales, na.rm = TRUE) / mean(Weekly_Sales, na.rm = TRUE),
holiday_mean = mean(Weekly_Sales[Holiday_Flag == 1], na.rm = TRUE),
nonholiday_mean = mean(Weekly_Sales[Holiday_Flag == 0], na.rm = TRUE),
holiday_lift_pct = (holiday_mean - nonholiday_mean) / nonholiday_mean * 100,
.groups = "drop"
) %>%
mutate(holiday_lift_pct = ifelse(is.finite(holiday_lift_pct), holiday_lift_pct, NA_real_)) %>%
drop_na()
# Scale for k-means
X <- store_features %>%
select(avg_sales, cv_sales, holiday_lift_pct) %>%
scale()
set.seed(123)
# Run k-means for different k values and compare cluster sizes (stability proxy)
cluster_sizes <- map_dfr(2:6, function(k){
km <- kmeans(X, centers = k, nstart = 25)
tibble(k = k, cluster = factor(km$cluster)) %>%
count(k, cluster, name = "n_stores")
})
cluster_sizesggplot(cluster_sizes, aes(x = cluster, y = n_stores)) +
geom_col(alpha = 0.85) +
facet_wrap(~ k, scales = "free_x") +
labs(
title = "Cluster Size Patterns Across Different k (Stability Check)",
x = "Cluster label",
y = "Number of stores"
)Interpretation (Q67):
If cluster sizes become extreme (e.g., one tiny cluster) as k changes,
segmentation may be less stable. If patterns remain reasonable across k
values, that suggests the segmentation structure is more robust.
# ------------------------------------------------------------
# Choose a k (use k=3 for a simple, interpretable structure)
# ------------------------------------------------------------
set.seed(123)
k <- 3
km3 <- kmeans(X, centers = k, nstart = 25)
store_clustered <- store_features %>%
mutate(Segment = factor(km3$cluster)) %>%
select(Store, Segment)
# Merge segment labels back to the full weekly dataset
walmart_seg <- walmart %>%
inner_join(store_clustered, by = "Store")
# Fit the same model by segment: Weekly_Sales ~ Holiday_Flag + controls
seg_models <- walmart_seg %>%
group_by(Segment) %>%
do(
tidy(lm(Weekly_Sales ~ Holiday_Flag + Year + Month, data = .))
) %>%
ungroup()
# Keep only the Holiday_Flag coefficient for comparison
holiday_by_segment <- seg_models %>%
filter(term == "Holiday_Flag") %>%
select(Segment, estimate, std.error, p.value)
holiday_by_segmentggplot(holiday_by_segment, aes(x = Segment, y = estimate)) +
geom_col(alpha = 0.85) +
scale_y_continuous(labels = comma) +
labs(
title = "Estimated Holiday Effect by Store Segment (with time controls)",
x = "Segment",
y = "Holiday coefficient (sales units)"
)Interpretation (Q72):
If holiday coefficients differ across segments, it means holiday
promotions are not equally effective everywhere. This directly supports
differentiated strategy: focus holiday operational complexity where the
segment response is strongest.
# ------------------------------------------------------------
# Simple train/test split (beginner-friendly)
# We predict Weekly_Sales and compare RMSE with vs without Holiday_Flag
# ------------------------------------------------------------
set.seed(123)
n <- nrow(walmart)
test_idx <- sample(1:n, size = round(0.2 * n)) # 20% test
train <- walmart[-test_idx, ]
test <- walmart[test_idx, ]
# Model A: without holiday
m_no_holiday <- lm(Weekly_Sales ~ Year + Month + CPI + Unemployment + Fuel_Price + Temperature, data = train)
# Model B: with holiday
m_with_holiday <- lm(Weekly_Sales ~ Holiday_Flag + Year + Month + CPI + Unemployment + Fuel_Price + Temperature, data = train)
# Predictions
pred_A <- predict(m_no_holiday, newdata = test)
pred_B <- predict(m_with_holiday, newdata = test)
# RMSE function
rmse <- function(actual, predicted) sqrt(mean((actual - predicted)^2, na.rm = TRUE))
rmse_A <- rmse(test$Weekly_Sales, pred_A)
rmse_B <- rmse(test$Weekly_Sales, pred_B)
tibble(
Model = c("No Holiday Variable", "With Holiday Variable"),
RMSE = c(rmse_A, rmse_B)
)Interpretation (Q82):
If adding the holiday variable reduces RMSE, it improves predictive
accuracy and can support inventory planning. If the improvement is very
small, it suggests that time and macro factors already capture much of
the predictable structure, and holiday flags add limited incremental
predictive value.
Evidence to reference from this report:
- Store identity explains a meaningful share of variation (fixed-effects
insight).
- Segments differ in volatility and holiday responsiveness.
- Holiday coefficients differ across segments.
Interpretation (Q91):
Analytics suggests that a single national strategy is risky because
stores differ significantly in baseline sales, volatility, and holiday
sensitivity. A better approach is a hybrid model: maintain a national
framework (e.g., core holiday calendar) but apply local differentiation
using store segments. For example, holiday-heavy operational plans
should focus on segments where holiday effects are consistently strong,
while stable low-impact segments should focus on cost efficiency and
steady inventory.
This individual Quiz 3 analysis complements Mini Project 2 by: - adding macro trend context (Q9), - showing single-variable risk (Q13), - checking multicollinearity (Q30), - adding timing nuance with pre-holiday effects (Q35), - testing an interaction mechanism (Q44), - quantifying store identity importance (Q53), - checking segmentation robustness (Q67), - comparing holiday effectiveness across segments (Q72), - and evaluating predictive value of holiday variables (Q82), which together support the CEO’s need for clearer, more targeted decisions.