Project Background and Objective

Walmart’s senior management suspects that overall sales performance is driven by a limited number of stores, while sales volatility may originate from a different subset of locations. At the same time, the CEO is uncertain whether observed weekly fluctuations can be systematically explained by holiday effects and macroeconomic conditions, or whether they reflect random noise.

The objective of this project is to translate these managerial concerns into data-driven insights using exploratory analysis, regression modeling, and clustering techniques. The ultimate goal is to provide actionable recommendations rather than purely statistical results.

1 Research Questions

The dataset and project brief provide a wide range of potential analytical questions related to sales performance, volatility, holidays, and macroeconomic conditions. To maintain analytical focus and managerial relevance, this study deliberately narrows the scope to a small set of interconnected research questions. These questions are selected based on (i) the CEO’s stated concerns, (ii) feasibility given the available data, and (iii) their ability to support actionable insights rather than purely descriptive findings.

The final research questions guiding this analysis are as follows:

RQ1: Sales Concentration and Volatility
Which stores contribute disproportionately to Walmart’s total sales, and which stores contribute disproportionately to sales volatility?

RQ2: Holiday Effects Beyond a Binary Indicator
How do holiday periods affect weekly sales levels and volatility, and to what extent does a simple holiday flag capture the true heterogeneity of holiday-driven demand?

RQ3: Drivers of Weekly Sales Performance
To what extent can weekly sales variation be explained by holidays, time structure (trend and seasonality), and macroeconomic conditions?

RQ4: Store Segmentation for Managerial Action
Can Walmart stores be segmented into meaningful groups based on sales level, volatility, and holiday sensitivity, and how can these segments inform differentiated managerial strategies?

Together, these research questions structure the analysis from descriptive diagnosis (RQ1), to causal explanation (RQ2–RQ3), and finally to prescriptive insight (RQ4). This progression ensures that the analysis remains both analytically rigorous and managerially relevant.

2 STEP 1 — Data Import & Data Understanding

2.1 Data Import

This step ensures that the dataset is successfully loaded and that the column names and first observations match expectations.

# ------------------------------------------------
# STEP 1.1 – Import Walmart Dataset
# ------------------------------------------------

# The Excel file should be uploaded to Posit Cloud
# and located in the same folder as this RMD file

walmart_raw <- read_excel("WalmartData.xlsx", sheet = "in")

# First look at the data
head(walmart_raw)

2.2 Data Structure & Variable Types

walmart <- walmart_raw %>%
  mutate(
    # Convert these columns explicitly to numeric
    Weekly_Sales = as.numeric(gsub(",", ".", Weekly_Sales)),
    Temperature  = as.numeric(gsub(",", ".", Temperature)),
    Fuel_Price   = as.numeric(gsub(",", ".", Fuel_Price)),
    CPI          = as.numeric(gsub(",", ".", CPI)),
    Unemployment = as.numeric(gsub(",", ".", Unemployment)),
    Holiday_Flag = as.integer(Holiday_Flag),
    Store        = as.integer(Store)
  )

glimpse(walmart)
## Rows: 6,435
## Columns: 8
## $ Store        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Date         <chr> "05-02-2010", "12-02-2010", "19-02-2010", "26-02-2010", "…
## $ Weekly_Sales <dbl> 1643691, 1641957, 1611968, 1409728, 1554807, 1439542, 147…
## $ Holiday_Flag <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Temperature  <dbl> 42.31, 38.51, 39.93, 46.63, 46.50, 57.79, 54.58, 51.45, 6…
## $ Fuel_Price   <dbl> 2.572, 2.548, 2.514, 2.561, 2.625, 2.667, 2.720, 2.732, 2…
## $ CPI          <dbl> 211.0964, 211.2422, 211.2891, 211.3196, 211.3501, 211.380…
## $ Unemployment <dbl> 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 7…

2.3 Descriptive Statistics

summary(walmart_raw)
##     Store               Date           Weekly_Sales       Holiday_Flag      
##  Length:6435        Length:6435        Length:6435        Length:6435       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  Temperature         Fuel_Price            CPI            Unemployment      
##  Length:6435        Length:6435        Length:6435        Length:6435       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

As a result of the initial data audit, the dataset requires preprocessing to ensure correct variable types. This step is necessary to produce valid descriptive insights and reliable statistical models.

2.4 Missing Values Check

The missing values analysis indicates that there are no missing observations across all variables in the dataset. Each column, including sales, holiday indicators, and macroeconomic variables, contains complete information for all time periods and stores. This is a strong advantage for the analysis, as it eliminates the need for imputation or row removal strategies that could otherwise introduce bias or reduce statistical power. Consequently, subsequent exploratory and modeling steps can be conducted on the full dataset without concerns related to data incompleteness.

# ------------------------------------------------
# STEP 1.4 – Missing values
# ------------------------------------------------

missing_summary <- walmart %>%
  summarise(across(everything(), ~ sum(is.na(.))))

missing_summary

2.5 Date Conversion & Feature Engineering

Time effects (trend, seasonality, holidays) cannot be analyzed unless dates are correctly formatted.

# ------------------------------------------------
# STEP 1.5 – Date handling and time variables
# ------------------------------------------------

walmart <- walmart %>%
  mutate(
    Date  = dmy(Date),              # convert text to Date
    Year  = year(Date),
    Month = month(Date, label = TRUE)
  )

summary(walmart$Date)
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2010-02-05" "2010-10-08" "2011-06-17" "2011-06-17" "2012-02-24" "2012-10-26"

After the data type conversion, all key variables are correctly formatted. Weekly sales and macroeconomic variables are stored as numeric values, the store identifier is an integer, and the date variable has been successfully parsed. This confirms that the dataset is now suitable for aggregation, visualization, and statistical modeling.

3 STEP 2 — Exploratory Analysis: Store Concentration & Volatility

Purpose of Step 2 To empirically test the CEO’s intuition that:

  • a small number of stores drive total sales

  • a different subset drives sales volatility

This step is descriptive but strategic — it defines where to look next.

3.1 Store-Level Sales and Volatility Metrics

# ------------------------------------------------
# STEP 2.1 – Store-level performance metrics
# ------------------------------------------------

# ------------------------------------------------
# Fix: Convert numeric-looking text columns to numeric
# ------------------------------------------------




store_summary <- walmart %>%
  group_by(Store) %>%
  summarise(
    avg_weekly_sales = mean(Weekly_Sales, na.rm = TRUE),
    total_sales      = sum(Weekly_Sales, na.rm = TRUE),
    sd_sales         = sd(Weekly_Sales, na.rm = TRUE),
    cv_sales         = sd_sales / avg_weekly_sales,  # relative volatility
    .groups = "drop"
  ) %>%
  mutate(
    sales_share = total_sales / sum(total_sales),
    volatility_share = sd_sales / sum(sd_sales)
  )

# Inspect top stores
store_summary %>% arrange(desc(total_sales)) %>% head(10)

The store-level summary statistics reveal substantial heterogeneity in performance across Walmart locations. The top-selling stores generate average weekly sales exceeding 2 million, while their coefficients of variation range between approximately 0.10 and 0.16. This indicates that even among the highest-revenue stores, relative sales stability differs meaningfully. Some large stores combine high sales with relatively stable demand, whereas others experience considerable week-to-week fluctuations despite similar revenue levels.

3.2 Sales Concentration: Do a Few Stores Dominate Revenue?

# ------------------------------------------------
# STEP 2.2 – Sales concentration plot
# ------------------------------------------------

top_sales <- store_summary %>%
  arrange(desc(total_sales)) %>%
  mutate(rank = row_number())

ggplot(top_sales, aes(x = rank, y = sales_share)) +
  geom_col() +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Store Ranking by Share of Total Sales",
    x = "Store Rank (1 = Highest Total Sales)",
    y = "Share of Total Sales"
  )

The sales concentration analysis shows a pronounced skew in revenue contribution across stores. The highest-ranked stores each account for roughly 4–4.5% of total sales, while the sales share declines steadily across lower-ranked locations. This pattern confirms that Walmart’s overall revenue is not evenly distributed, but instead depends heavily on a relatively small group of high-performing stores. Consequently, changes in performance at these top locations are likely to have a disproportionate impact on total company sales.

3.3 Volatility Concentration: Who Drives Weekly Instability?

# ------------------------------------------------
# STEP 2.3 – Volatility concentration plot
# ------------------------------------------------

top_volatility <- store_summary %>%
  arrange(desc(sd_sales)) %>%
  mutate(rank = row_number())

ggplot(top_volatility, aes(x = rank, y = volatility_share)) +
  geom_col() +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Store Ranking by Share of Sales Volatility",
    x = "Store Rank (1 = Most Volatile)",
    y = "Share of Total Volatility"
  )

A similar but distinct concentration pattern emerges when examining sales volatility. A limited number of stores contribute a disproportionately large share of total sales volatility, with the most volatile stores accounting for nearly 5% each. Importantly, volatility declines gradually across the ranking, indicating that instability is concentrated rather than widespread. This suggests that Walmart’s week-to-week sales fluctuations are primarily driven by specific locations rather than by uniform, chain-wide shocks.

3.4 Are High-Sales Stores Also High-Volatility Stores?

# ------------------------------------------------
# STEP 2.4 – Sales rank vs volatility rank
# ------------------------------------------------

store_ranks <- store_summary %>%
  mutate(
    sales_rank = rank(-total_sales, ties.method = "first"),
    vol_rank   = rank(-sd_sales, ties.method = "first")
  )

ggplot(store_ranks, aes(x = sales_rank, y = vol_rank)) +
  geom_point() +
  labs(
    title = "Relationship Between Sales Rank and Volatility Rank",
    x = "Sales Rank (1 = Highest Sales)",
    y = "Volatility Rank (1 = Highest Volatility)"
  )

The comparison between sales rank and volatility rank indicates that high-revenue stores are not systematically the most volatile. While some high-sales locations also exhibit high volatility, many top-selling stores display moderate or low volatility rankings. Conversely, several lower-ranked sales stores appear among the most volatile. This dispersion suggests the presence of distinct store roles within the network, separating revenue-driving locations from volatility-driving ones.

3.5 Step 2 Summary

Overall, the exploratory analysis confirms that Walmart’s sales and volatility are both highly concentrated but not perfectly aligned across stores. A small number of locations drive a large share of total revenue, while a partially different subset drives sales instability. These findings validate the CEO’s initial intuition and demonstrate that aggregate metrics mask important store-level dynamics. As a result, subsequent analyses should focus on identifying the drivers of volatility and sales at a more granular level, particularly examining the role of holidays and temporal effects.

4 STEP 3 — Holiday Analysis

In Step 2, we showed that sales and volatility are concentrated in specific stores. However, store-level patterns alone do not explain why weekly sales fluctuate. One frequently cited explanation is the effect of holidays. While the dataset provides a binary holiday indicator (Holiday_Flag), relying solely on a 0/1 variable may oversimplify the underlying dynamics. Therefore, this step examines holiday effects in a structured but focused manner, without losing sight of the broader analytical objective.

4.1 Holiday vs Non-Holiday: Average Sales Comparison

What we do

We first compare average weekly sales during holiday weeks versus non-holiday weeks. This provides an intuitive benchmark for understanding whether holidays matter at all.

# ------------------------------------------------
# STEP 3.1 – Average sales: Holiday vs Non-Holiday
# ------------------------------------------------

holiday_sales_summary <- walmart %>%
  group_by(Holiday_Flag) %>%
  summarise(
    avg_weekly_sales = mean(Weekly_Sales, na.rm = TRUE),
    sd_weekly_sales  = sd(Weekly_Sales, na.rm = TRUE),
    n_weeks = n(),
    .groups = "drop"
  )

holiday_sales_summary

The comparison of average weekly sales indicates that holiday weeks are associated with a different sales level than non-holiday weeks. The difference in mean sales suggests that holidays are not neutral periods and may systematically shift consumer purchasing behavior. However, this aggregate comparison does not yet indicate whether the observed difference is large enough to be economically meaningful or consistent across stores. “Holiday effect var gibi görünüyor” ama tek başına yeterli değil.

4.2 Visual Comparison: Sales Distribution

Why visualization matters

Averages can hide variability. Therefore, we compare the distribution of weekly sales during holiday and non-holiday weeks.

# ------------------------------------------------
# STEP 3.2 – Sales distribution by holiday status
# ------------------------------------------------

ggplot(walmart, aes(x = factor(Holiday_Flag), y = Weekly_Sales)) +
  geom_boxplot() +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Weekly Sales Distribution: Holiday vs Non-Holiday Weeks",
    x = "Holiday Flag (0 = Non-Holiday, 1 = Holiday)",
    y = "Weekly Sales"
  )

The boxplot comparison shows that holiday weeks tend to shift the sales distribution upward relative to non-holiday weeks. At the same time, the wider spread during holiday periods indicates increased dispersion, suggesting that holidays may amplify not only sales levels but also sales uncertainty. This observation reinforces the need to examine volatility effects in addition to average sales differences.

4.3 Holiday Lift: How Large Is the Effect?

What is “holiday lift”?

Holiday lift measures the percentage difference in average sales between holiday and non-holiday weeks.

# ------------------------------------------------
# STEP 3.3 – Holiday lift calculation
# ------------------------------------------------

holiday_lift <- holiday_sales_summary %>%
  summarise(
    holiday_lift_pct =
      (avg_weekly_sales[Holiday_Flag == 1] -
       avg_weekly_sales[Holiday_Flag == 0]) /
       avg_weekly_sales[Holiday_Flag == 0] * 100
  )

holiday_lift

The holiday lift calculation quantifies the magnitude of the holiday effect in percentage terms. This metric provides a more interpretable measure for managerial decision-making, as it translates raw sales differences into relative performance gains. A modest lift suggests that holidays play a secondary role, whereas a large lift would indicate that holiday-driven demand is a major contributor to overall sales performance. Holiday’i sadece “var/yok” demiyoruz → etki büyüklüğünü ölçüyoruz.

4.4 Holiday and Volatility

Why this matters

The CEO’s concern is not only about sales levels, but also about instability. We therefore compare volatility across holiday and non-holiday weeks.

# ------------------------------------------------
# STEP 3.4 – Volatility comparison
# ------------------------------------------------

holiday_volatility <- walmart %>%
  group_by(Holiday_Flag) %>%
  summarise(
    sd_weekly_sales = sd(Weekly_Sales, na.rm = TRUE),
    .groups = "drop"
  )

holiday_volatility

The volatility comparison reveals that holiday weeks are associated with a different level of sales variability compared to non-holiday periods. This suggests that holidays may contribute not only to higher average sales, but also to increased uncertainty in weekly performance. Such a pattern is consistent with demand spikes that are unevenly distributed across stores and product categories.

4.5 Zoom-in: Major Holiday Weeks (Black Friday–style Insight)

While the binary holiday flag captures whether a week is classified as a holiday, it does not distinguish between routine holidays and peak demand events such as major shopping periods. To better understand this limitation, we examine weekly sales patterns around the most prominent holiday periods.

ggplot(walmart, aes(x = Date, y = Weekly_Sales)) +
  geom_line(alpha = 0.3) +
  geom_point(
    data = walmart %>% filter(Holiday_Flag == 1),
    aes(x = Date, y = Weekly_Sales),
    color = "red",
    alpha = 0.6
  ) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Weekly Sales Over Time with Holiday Weeks Highlighted",
    x = "Date",
    y = "Weekly Sales"
  )

The time-series visualization reveals that holiday weeks are associated with noticeable sales spikes, but these spikes are not uniform across all holiday periods. While some holiday weeks correspond to exceptionally high sales levels, others remain close to the baseline observed during regular weeks. This heterogeneity indicates that the binary holiday indicator aggregates events with substantially different demand intensities. As a result, treating all holiday weeks as economically equivalent may obscure important variation in sales responses.

ggplot(
  walmart %>% filter(Holiday_Flag == 1),
  aes(x = Weekly_Sales)
) +
  geom_histogram(bins = 30, fill = "grey", color = "white") +
  scale_x_continuous(labels = comma) +
  labs(
    title = "Distribution of Weekly Sales During Holiday Weeks",
    x = "Weekly Sales",
    y = "Frequency"
  )

The distribution of weekly sales during holiday weeks is highly dispersed and right-skewed, with a long upper tail extending toward very high sales values. This pattern suggests that a subset of holiday periods generates disproportionately large sales outcomes, while many holiday weeks exhibit more moderate performance. Such dispersion reinforces the limitation of a simple holiday flag and highlights the need for models that can distinguish between routine holidays and peak demand events.

4.6 Step 3 Summary: What Holidays Explain — and What They Don’t

The extended holiday analysis demonstrates that holiday periods are generally associated with higher average sales and increased variability in weekly performance. However, the time-series and distributional evidence shows that holiday effects are highly heterogeneous: only a subset of holiday weeks generates extreme sales spikes, while others resemble non-holiday periods. This finding confirms that a binary holiday indicator provides an incomplete representation of holiday-driven demand. Consequently, while holidays play an important role in explaining sales fluctuations, they cannot fully account for observed patterns in sales concentration and volatility. This motivates the use of multivariate regression models that incorporate time trends, macroeconomic variables, and store-level heterogeneity to disentangle overlapping effects.

5 STEP 4 — Regression Analysis: What Drives Weekly Sales?

Steps 2 and 3 demonstrated that sales and volatility are unevenly distributed across stores and that holiday effects are heterogeneous. However, descriptive analysis alone cannot isolate the relative importance of holidays, time trends, and macroeconomic conditions. Therefore, this step employs regression models to quantify the drivers of weekly sales while controlling for overlapping effects. The objective is not prediction accuracy, but explanatory clarity for managerial decision-making.

5.1 Model 1 — Baseline: Holiday Effect Only

Why this model?

We begin with the simplest possible specification to establish a benchmark. This model answers the question: If we only consider whether a week is a holiday, how much does that explain sales?

# ------------------------------------------------
# STEP 4.1 – Model 1: Holiday only
# ------------------------------------------------

model_1 <- lm(Weekly_Sales ~ Holiday_Flag, data = walmart)

summary(model_1)
## 
## Call:
## lm(formula = Weekly_Sales ~ Holiday_Flag, data = walmart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -907529 -493200  -85790  375499 2777430 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1041256       7291 142.820  < 2e-16 ***
## Holiday_Flag    81632      27570   2.961  0.00308 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 564000 on 6433 degrees of freedom
## Multiple R-squared:  0.001361,   Adjusted R-squared:  0.001206 
## F-statistic: 8.767 on 1 and 6433 DF,  p-value: 0.003079
  • Holiday coefficient ≈ +81,600

  • Statistically significant (p ≈ 0.003)

  • R² ≈ 0.0014

The baseline regression shows that holiday weeks are associated with an average increase of approximately 81,600 in weekly sales relative to non-holiday weeks, and this effect is statistically significant. However, the model’s explanatory power is extremely limited, with an R-squared value close to zero. This indicates that although holidays are associated with higher sales on average, they explain only a negligible fraction of the overall variation in weekly sales. Consequently, holiday effects alone are insufficient to understand Walmart’s sales dynamics.

5.2 Model 2 — Adding Time Structure (Trend & Seasonality)

Why add time?

Sales naturally evolve over time due to inflation, store maturity, and seasonality. Ignoring time effects risks attributing long-term trends to holidays.

# ------------------------------------------------
# STEP 4.2 – Model 2: Holiday + time effects
# ------------------------------------------------

model_2 <- lm(
  Weekly_Sales ~ Holiday_Flag + Year + Month,
  data = walmart
)

summary(model_2)
## 
## Call:
## lm(formula = Weekly_Sales ~ Holiday_Flag + Year + Month, data = walmart)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1096762  -480302   -77631   374868  2548039 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -18537153   18294914  -1.013 0.310983    
## Holiday_Flag     31735      30140   1.053 0.292425    
## Year              9739       9098   1.071 0.284412    
## Month.L         191021      27350   6.984 3.15e-12 ***
## Month.Q          90371      26610   3.396 0.000687 ***
## Month.C         180388      25567   7.056 1.90e-12 ***
## Month^4          50854      25432   2.000 0.045587 *  
## Month^5          41901      25670   1.632 0.102660    
## Month^6         -73224      25221  -2.903 0.003705 ** 
## Month^7         -13208      24409  -0.541 0.588468    
## Month^8         -29617      24152  -1.226 0.220132    
## Month^9           3463      23242   0.149 0.881552    
## Month^10          7162      23488   0.305 0.760452    
## Month^11         30501      23056   1.323 0.185918    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 559500 on 6421 degrees of freedom
## Multiple R-squared:  0.01915,    Adjusted R-squared:  0.01717 
## F-statistic: 9.646 on 13 and 6421 DF,  p-value: < 2.2e-16
  • Holiday_Flag → not meaningful statistically

  • Year → not meaningful

  • Month effects → strong

  • R² ≈ 0.019

After controlling for time effects, the estimated holiday coefficient becomes statistically insignificant, suggesting that part of the apparent holiday effect observed in the baseline model was capturing underlying seasonal patterns. Several month indicators are highly significant, highlighting strong seasonality in weekly sales. Although the overall explanatory power remains modest, the increase in R-squared relative to the baseline model demonstrates that temporal structure plays a substantially more important role than holidays alone.

5.3 Model 3 — Full Model: Holiday + Time + Macroeconomic Factors

Why macro variables?

Consumer spending is influenced by broader economic conditions. This model evaluates whether sales respond to CPI, unemployment, fuel prices, and weather, beyond holidays and time.

# ------------------------------------------------
# STEP 4.3 – Model 3: Full explanatory model
# ------------------------------------------------

model_3 <- lm(
  Weekly_Sales ~ Holiday_Flag + Year + Month +
    Temperature + Fuel_Price + CPI + Unemployment,
  data = walmart
)

summary(model_3)
## 
## Call:
## lm(formula = Weekly_Sales ~ Holiday_Flag + Year + Month + Temperature + 
##     Fuel_Price + CPI + Unemployment, data = walmart)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1176560  -469777  -103678   394114  2570012 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  93755105.5 35838224.8   2.616 0.008916 ** 
## Holiday_Flag    32950.6    29780.3   1.106 0.268571    
## Year           -45907.2    17866.1  -2.570 0.010206 *  
## Month.L        154044.2    30091.1   5.119 3.16e-07 ***
## Month.Q        128243.2    42972.5   2.984 0.002853 ** 
## Month.C        168301.5    27699.8   6.076 1.30e-09 ***
## Month^4         55092.7    26097.9   2.111 0.034811 *  
## Month^5         51758.6    25737.7   2.011 0.044367 *  
## Month^6        -83331.3    25249.5  -3.300 0.000971 ***
## Month^7        -13714.2    24165.5  -0.568 0.570386    
## Month^8        -25203.2    24010.7  -1.050 0.293911    
## Month^9          -200.7    22959.6  -0.009 0.993027    
## Month^10         4827.6    23203.5   0.208 0.835195    
## Month^11        33206.7    22796.1   1.457 0.145252    
## Temperature       312.9      693.0   0.452 0.651616    
## Fuel_Price      70304.0    30778.4   2.284 0.022392 *  
## CPI             -1529.3      211.2  -7.242 4.93e-13 ***
## Unemployment   -47530.8     4307.4 -11.035  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 552400 on 6417 degrees of freedom
## Multiple R-squared:  0.04444,    Adjusted R-squared:  0.04191 
## F-statistic: 17.56 on 17 and 6417 DF,  p-value: < 2.2e-16
  • Holiday_Flag → still not meaningful

  • Year → negative and meaningful

  • CPI → negative and very strong

  • Unemployment → negative and very strong

  • Fuel_Price → positive and meaningful

  • Temperature → not meaningful

  • R² ≈ 0.044

The full model provides the highest explanatory power among all specifications, indicating that weekly sales are influenced by a combination of temporal and macroeconomic factors. Once these variables are included, the holiday indicator remains statistically insignificant, confirming that holidays do not independently drive sales when broader economic conditions are taken into account. In contrast, unemployment and CPI exhibit strong negative relationships with sales, suggesting that macroeconomic pressure significantly dampens consumer spending. The positive coefficient on fuel prices may reflect correlated economic activity rather than a direct causal effect. Overall, these results indicate that macroeconomic conditions dominate holiday timing in explaining sales variation.

5.4 Comparing Models: What Really Adds Value?

# ------------------------------------------------
# STEP 4.4 – Model comparison
# ------------------------------------------------

model_comparison <- tibble(
  Model = c("Holiday only", "Holiday + Time", "Full Model"),
  R_squared = c(
    summary(model_1)$r.squared,
    summary(model_2)$r.squared,
    summary(model_3)$r.squared
  )
)

model_comparison

Actual R² values:

  • Holiday only → 0.001

  • Holiday + Time → 0.019

  • Full Model → 0.044

Model comparison reveals a clear hierarchy in explanatory power. The holiday-only model explains virtually none of the variation in weekly sales. Introducing time effects substantially improves model performance, while the inclusion of macroeconomic variables further doubles explanatory power. Despite this improvement, the relatively low R-squared values across all models indicate that a significant portion of sales variation remains store-specific or idiosyncratic, reinforcing the importance of store-level segmentation in subsequent analysis.

5.5 Step 4 Executive Summary

The regression analysis demonstrates that while holiday weeks are associated with higher sales in isolation, this effect disappears once time structure and macroeconomic conditions are accounted for. Seasonality and broader economic indicators, particularly unemployment and inflation, play a far more significant role in shaping weekly sales patterns. These findings suggest that managerial decisions focused solely on holiday timing may be misguided unless they are aligned with prevailing economic conditions. Furthermore, the limited explanatory power of aggregate models highlights the need to incorporate store-level heterogeneity, motivating a segmentation-based approach in the next step.

6 STEP 5 — Store Segmentation (K-Means Clustering)

Although regression models capture general patterns, they explain only a limited portion of weekly sales variation, suggesting that store-level differences play a major role. Therefore, we segment stores into meaningful groups using k-means clustering. The objective is to create actionable store segments (e.g., “high sales–stable” vs. “high volatility”) that can support differentiated managerial strategies for planning, resource allocation, and promotion design.

6.1 Create Store-Level Features for Clustering

What we include

We build a store-level dataset with features that capture:

  • Sales level (average weekly sales)

  • Sales volatility (CV or SD)

  • Holiday sensitivity (holiday lift at store level)

# ------------------------------------------------
# STEP 5.1 – Store-level dataset for clustering
# ------------------------------------------------

store_cluster_df <- walmart %>%
  group_by(Store) %>%
  summarise(
    avg_sales = mean(Weekly_Sales, na.rm = TRUE),
    sd_sales  = sd(Weekly_Sales, na.rm = TRUE),
    cv_sales  = sd_sales / avg_sales,
    
    # Holiday sensitivity: difference between holiday and non-holiday mean sales per store
    holiday_mean = mean(Weekly_Sales[Holiday_Flag == 1], na.rm = TRUE),
    nonholiday_mean = mean(Weekly_Sales[Holiday_Flag == 0], na.rm = TRUE),
    holiday_lift_pct = (holiday_mean - nonholiday_mean) / nonholiday_mean * 100,
    
    .groups = "drop"
  )

# Check for infinite / NA lift values (can happen if a store has no non-holiday weeks)
store_cluster_df <- store_cluster_df %>%
  mutate(
    holiday_lift_pct = ifelse(is.finite(holiday_lift_pct), holiday_lift_pct, NA_real_)
  )

summary(store_cluster_df)
##      Store      avg_sales          sd_sales         cv_sales      
##  Min.   : 1   Min.   : 259862   Min.   : 21837   Min.   :0.04208  
##  1st Qu.:12   1st Qu.: 556404   1st Qu.: 85770   1st Qu.:0.11502  
##  Median :23   Median : 966782   Median :128753   Median :0.13090  
##  Mean   :23   Mean   :1046965   Mean   :141586   Mean   :0.13262  
##  3rd Qu.:34   3rd Qu.:1395901   3rd Qu.:191723   3rd Qu.:0.15913  
##  Max.   :45   Max.   :2107677   Max.   :317570   Max.   :0.22968  
##   holiday_mean     nonholiday_mean   holiday_lift_pct
##  Min.   : 262595   Min.   : 259656   Min.   :-2.380  
##  1st Qu.: 606958   1st Qu.: 555555   1st Qu.: 5.319  
##  Median :1072047   Median : 961128   Median : 7.657  
##  Mean   :1122888   Mean   :1041256   Mean   : 7.398  
##  3rd Qu.:1500026   3rd Qu.:1388073   3rd Qu.: 9.893  
##  Max.   :2249035   Max.   :2097048   Max.   :19.439

This feature engineering step translates the weekly dataset into a store-level analytic view. In addition to sales level and volatility, we explicitly include holiday sensitivity as a behavioral feature. This ensures that segmentation is not based solely on store size, but also captures how stores respond to peak-demand periods.

6.2 Prepare Data for K-Means (Scaling)

K-means is distance-based, so variables with large scales dominate results. Therefore, we standardize variables before clustering.

# ------------------------------------------------
# STEP 5.2 – Scale variables for k-means
# ------------------------------------------------

cluster_features <- store_cluster_df %>%
  select(avg_sales, cv_sales, holiday_lift_pct) %>%
  drop_na()  # remove stores with missing holiday lift

cluster_scaled <- scale(cluster_features)

Scaling ensures that average sales, volatility, and holiday sensitivity contribute equally to the clustering solution. Without scaling, average sales would dominate the segmentation and reduce interpretability.

6.3 Choose the Number of Clusters (Elbow Method)

# ------------------------------------------------
# STEP 5.3 – Elbow method
# ------------------------------------------------

set.seed(123)

wss <- map_dbl(1:10, function(k) {
  kmeans(cluster_scaled, centers = k, nstart = 25)$tot.withinss
})

elbow_df <- tibble(k = 1:10, wss = wss)

ggplot(elbow_df, aes(x = k, y = wss)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Elbow Method for Choosing k",
    x = "Number of clusters (k)",
    y = "Total within-cluster sum of squares"
  )

The elbow plot shows a sharp reduction in within-cluster variation up to three clusters, after which the marginal improvement becomes noticeably smaller. This clear “elbow” at k = 3 indicates that a three-segment solution provides an appropriate balance between explanatory power and interpretability. Therefore, k = 3 is selected for the final clustering analysis.

6.4 Run K-Means with Chosen k

# ------------------------------------------------
# STEP 5.4 – K-means clustering
# ------------------------------------------------

set.seed(123)
k <- 3

kmeans_model <- kmeans(cluster_scaled, centers = k, nstart = 25)

store_clustered <- store_cluster_df %>%
  drop_na(holiday_lift_pct) %>%
  mutate(Segment = factor(kmeans_model$cluster))

store_clustered %>% count(Segment)

6.5 Segment Profiling (What Differentiates Segments?)

# ------------------------------------------------
# STEP 5.5 – Segment summary table
# ------------------------------------------------

segment_profile <- store_clustered %>%
  group_by(Segment) %>%
  summarise(
    n_stores = n(),
    avg_sales_mean = mean(avg_sales),
    cv_sales_mean = mean(cv_sales),
    holiday_lift_mean = mean(holiday_lift_pct),
    .groups = "drop"
  )

segment_profile
Segment Interpretation
Segment 1 (8 stores) Low sales – very stable – negative holiday lift
Segment 2 (14 stores) Mid sales – highly volatile – strongly holiday-driven
Segment 3 (23 stores) High sales – moderately volatile – positive holiday response

The segmentation results reveal three clearly differentiated store types. Segment 1 consists of low-sales but highly stable stores, with a slightly negative average holiday lift, indicating limited responsiveness to holiday periods. Segment 2 is characterized by moderate sales levels combined with the highest relative volatility and the strongest positive holiday response, suggesting that demand in these stores is both event-driven and unstable. Segment 3 includes high-sales stores with moderate volatility and a consistently positive holiday lift, positioning them as core revenue drivers with predictable holiday benefits.

6.6 Visualize Segments

Plot A: Sales vs Volatility by Segment

ggplot(store_clustered, aes(x = avg_sales, y = cv_sales, shape = Segment)) +
  geom_point(size = 3, alpha = 0.8) +
  scale_x_continuous(labels = comma) +
  labs(
    title = "Store Segments: Average Sales vs Relative Volatility",
    x = "Average Weekly Sales",
    y = "Coefficient of Variation (CV)"
  )

The sales–volatility scatter plot highlights clear structural differences across segments. Segment 1 stores cluster at lower sales levels with minimal volatility, indicating predictable but limited revenue contribution. Segment 2 displays the highest dispersion in volatility for similar sales levels, reinforcing their classification as unstable and demand-sensitive locations. Segment 3 stores dominate the higher sales range while maintaining moderate volatility, confirming their role as Walmart’s primary revenue engines.

Plot B: Holiday Sensitivity by Segment

ggplot(store_clustered, aes(x = Segment, y = holiday_lift_pct)) +
  geom_boxplot() +
  labs(
    title = "Holiday Lift Distribution by Store Segment",
    x = "Segment",
    y = "Holiday Lift (%)"
  )

The holiday lift distribution further differentiates store segments. Segment 1 shows a slightly negative holiday effect, suggesting that holiday periods do not generate incremental demand in these locations. In contrast, Segment 2 exhibits the highest and most dispersed holiday lift, indicating strong but unpredictable responses to holiday events. Segment 3 demonstrates a positive and more stable holiday lift, implying that holiday promotions reliably enhance sales in high-performing stores.

6.7 Step 5 Executive Summary

The k-means clustering results suggest that Walmart stores can be grouped into distinct, interpretable segments based on sales level, sales stability, and holiday sensitivity. This segmentation enables differentiated strategies: high-sales stable stores may benefit from operational optimization and inventory prioritization, volatility-driven stores require demand smoothing and risk management, and holiday-responsive stores can be targeted with event-based promotions. Overall, segmentation offers a practical bridge between analytics and managerial action.

7 Final Managerial Recommendations

7.1 Overview

The analysis reveals that Walmart’s sales performance and volatility are not driven by a single universal factor such as holidays or macroeconomic conditions. Instead, store-level heterogeneity plays a dominant role. Based on the k-means segmentation results, three distinct store segments emerge, each requiring a differentiated managerial approach. The recommendations below translate the analytical findings into concrete, actionable strategies.


7.2 Segment 1: Stable Low-Impact Stores

Profile - Low average weekly sales
- Very low relative volatility
- Slightly negative or negligible holiday lift

Managerial Implications

These stores exhibit predictable demand patterns but contribute limited incremental revenue, particularly during holiday periods. Aggressive promotional activity in these locations is unlikely to yield meaningful returns.

Recommended Actions - Avoid heavy holiday-specific promotions
- Focus on cost efficiency and operational stability
- Use these stores as baseline demand benchmarks for

8 Conclusion, Limitations, and Future Research

8.1 Conclusion

This study addressed a focused set of research questions aimed at understanding Walmart’s sales dynamics from both an analytical and managerial perspective.

Regarding RQ1, the analysis showed that both sales and sales volatility are highly concentrated across stores, but not perfectly aligned. A relatively small number of stores account for a disproportionate share of total sales, while a partially different subset drives overall volatility. This finding confirms that aggregate performance metrics mask important store-level differences.

With respect to RQ2, the holiday analysis demonstrated that holiday periods are associated with higher average sales and increased dispersion. However, the effect of holidays is highly heterogeneous: only a subset of holiday weeks generates extreme sales spikes, while others resemble regular periods. This confirms that a simple binary holiday indicator provides an incomplete representation of holiday-driven demand.

In answering RQ3, regression results revealed that while holidays appear significant in isolation, their effect disappears once time structure and macroeconomic conditions are controlled for. Seasonality and broader economic indicators—particularly unemployment and inflation—play a substantially more important role in explaining weekly sales variation. These findings suggest that holiday timing alone is insufficient for understanding or predicting sales performance.

Finally, RQ4 demonstrated that store segmentation offers a practical solution to the limitations of aggregate modeling. Using k-means clustering, stores were grouped into three distinct segments based on sales level, volatility, and holiday sensitivity. These segments exhibit clearly differentiated behavioral patterns, enabling targeted managerial strategies rather than uniform, chain-wide policies.

Overall, the results highlight the importance of combining descriptive analysis, explanatory modeling, and segmentation to translate complex sales data into actionable insights. By aligning analytical methods with well-defined research questions, the study provides a structured framework for data-driven decision-making in retail marketing analytics.


8.2 Limitations

Despite its insights, this analysis is subject to several limitations. First, the dataset does not distinguish between different types of holidays (e.g., major shopping events versus routine public holidays). As a result, holiday effects are captured in an aggregated manner, which may mask important differences in demand intensity across specific events. Second, the regression models rely on aggregate weekly sales and do not incorporate product-level or customer-level information, limiting the ability to analyze category-specific or behavioral drivers of demand.

Additionally, while macroeconomic variables improve explanatory power, the relatively low R-squared values suggest that unobserved factors—such as local competition, store-specific promotions, or regional demographics—may play a significant role. Finally, the clustering results depend on the selected variables and number of clusters; alternative specifications could yield slightly different segment structures.


8.3 Future Research Directions

Future research could extend this analysis in several ways. First, incorporating more granular holiday classifications (e.g., separating major promotional events from minor holidays) would allow for a more precise estimation of event-specific effects. Second, integrating store-level promotional data, pricing information, or product-category sales could improve explanatory power and enable deeper behavioral insights.

Moreover, advanced modeling techniques such as mixed-effects models or panel data approaches could explicitly account for store-specific fixed effects and dynamic responses over time. Finally, linking sales performance to customer or regional demographic data would support more personalized and location-specific marketing strategies, further enhancing the managerial relevance of the analysis.

---
title: "MiM811 Mini Project 2 – Walmart Sales Analytics"
author: "Group 4"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
output:
  html_document:
    theme: flatly
    highlight: pygments
    toc: true
    toc_depth: 3
    toc_float:
      collapsed: true
      smooth_scroll: true
    number_sections: true
    code_folding: show
    code_download: true
    df_print: paged
    self_contained: true
    mathjax: default
---

```{r setup, include=FALSE}
# ------------------------------------------------
# STEP 0 – Global Setup (runs automatically)
# ------------------------------------------------

# Show code in report (important for grading)
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE
)

# Load required libraries
library(tidyverse)   # data manipulation + visualization
library(readxl)      # reading Excel files
library(lubridate)   # date handling
library(scales)      # formatting axes (percent, comma)
library(broom)       # tidy regression outputs
```

**Project Background and Objective**

Walmart’s senior management suspects that overall sales performance is driven by a limited number of stores, while sales volatility may originate from a different subset of locations. At the same time, the CEO is uncertain whether observed weekly fluctuations can be systematically explained by holiday effects and macroeconomic conditions, or whether they reflect random noise.

The objective of this project is to translate these managerial concerns into data-driven insights using exploratory analysis, regression modeling, and clustering techniques. The ultimate goal is to provide actionable recommendations rather than purely statistical results.

# Research Questions

The dataset and project brief provide a wide range of potential analytical questions related to sales performance, volatility, holidays, and macroeconomic conditions. To maintain analytical focus and managerial relevance, this study deliberately narrows the scope to a small set of interconnected research questions. These questions are selected based on 
(i) the CEO’s stated concerns, 
(ii) feasibility given the available data, and 
(iii) their ability to support actionable insights rather than purely descriptive findings.

The final research questions guiding this analysis are as follows:

**RQ1: Sales Concentration and Volatility**  
Which stores contribute disproportionately to Walmart’s total sales, and which stores contribute disproportionately to sales volatility?

**RQ2: Holiday Effects Beyond a Binary Indicator**  
How do holiday periods affect weekly sales levels and volatility, and to what extent does a simple holiday flag capture the true heterogeneity of holiday-driven demand?

**RQ3: Drivers of Weekly Sales Performance**  
To what extent can weekly sales variation be explained by holidays, time structure (trend and seasonality), and macroeconomic conditions?

**RQ4: Store Segmentation for Managerial Action**  
Can Walmart stores be segmented into meaningful groups based on sales level, volatility, and holiday sensitivity, and how can these segments inform differentiated managerial strategies?

Together, these research questions structure the analysis from descriptive diagnosis (RQ1), to causal explanation (RQ2–RQ3), and finally to prescriptive insight (RQ4). This progression ensures that the analysis remains both analytically rigorous and managerially relevant.



# STEP 1 — Data Import & Data Understanding

## Data Import

> This step ensures that the dataset is successfully loaded and that the column names and first observations match expectations.

```{r readExcel}
# ------------------------------------------------
# STEP 1.1 – Import Walmart Dataset
# ------------------------------------------------

# The Excel file should be uploaded to Posit Cloud
# and located in the same folder as this RMD file

walmart_raw <- read_excel("WalmartData.xlsx", sheet = "in")

# First look at the data
head(walmart_raw)
```

## Data Structure & Variable Types

```{r step1.2}
walmart <- walmart_raw %>%
  mutate(
    # Convert these columns explicitly to numeric
    Weekly_Sales = as.numeric(gsub(",", ".", Weekly_Sales)),
    Temperature  = as.numeric(gsub(",", ".", Temperature)),
    Fuel_Price   = as.numeric(gsub(",", ".", Fuel_Price)),
    CPI          = as.numeric(gsub(",", ".", CPI)),
    Unemployment = as.numeric(gsub(",", ".", Unemployment)),
    Holiday_Flag = as.integer(Holiday_Flag),
    Store        = as.integer(Store)
  )

glimpse(walmart)
```

## Descriptive Statistics

```{r step1.3}

summary(walmart_raw)
```

>As a result of the initial data audit, the dataset requires preprocessing to ensure correct variable types. This step is necessary to produce valid descriptive insights and reliable statistical models.

## Missing Values Check

The missing values analysis indicates that there are no missing observations across all variables in the dataset. Each column, including sales, holiday indicators, and macroeconomic variables, contains complete information for all time periods and stores. This is a strong advantage for the analysis, as it eliminates the need for imputation or row removal strategies that could otherwise introduce bias or reduce statistical power. Consequently, subsequent exploratory and modeling steps can be conducted on the full dataset without concerns related to data incompleteness.

```{r step1.4}

# ------------------------------------------------
# STEP 1.4 – Missing values
# ------------------------------------------------

missing_summary <- walmart %>%
  summarise(across(everything(), ~ sum(is.na(.))))

missing_summary

```

## Date Conversion & Feature Engineering

Time effects (trend, seasonality, holidays) cannot be analyzed unless dates are correctly formatted.

```{r step1.5}
# ------------------------------------------------
# STEP 1.5 – Date handling and time variables
# ------------------------------------------------

walmart <- walmart %>%
  mutate(
    Date  = dmy(Date),              # convert text to Date
    Year  = year(Date),
    Month = month(Date, label = TRUE)
  )

summary(walmart$Date)
```


>After the data type conversion, all key variables are correctly formatted. Weekly sales and macroeconomic variables are stored as numeric values, the store identifier is an integer, and the date variable has been successfully parsed. This confirms that the dataset is now suitable for aggregation, visualization, and statistical modeling.


# STEP 2 — Exploratory Analysis: Store Concentration & Volatility

**Purpose of Step 2**
To empirically test the CEO’s intuition that:

- a small number of stores drive total sales

- a different subset drives sales volatility

This step is descriptive but strategic — it defines where to look next.

## Store-Level Sales and Volatility Metrics

```{r step2.1}
# ------------------------------------------------
# STEP 2.1 – Store-level performance metrics
# ------------------------------------------------

# ------------------------------------------------
# Fix: Convert numeric-looking text columns to numeric
# ------------------------------------------------




store_summary <- walmart %>%
  group_by(Store) %>%
  summarise(
    avg_weekly_sales = mean(Weekly_Sales, na.rm = TRUE),
    total_sales      = sum(Weekly_Sales, na.rm = TRUE),
    sd_sales         = sd(Weekly_Sales, na.rm = TRUE),
    cv_sales         = sd_sales / avg_weekly_sales,  # relative volatility
    .groups = "drop"
  ) %>%
  mutate(
    sales_share = total_sales / sum(total_sales),
    volatility_share = sd_sales / sum(sd_sales)
  )

# Inspect top stores
store_summary %>% arrange(desc(total_sales)) %>% head(10)
```


>The store-level summary statistics reveal substantial heterogeneity in performance across Walmart locations. The top-selling stores generate average weekly sales exceeding **2 million**, while their coefficients of variation range between approximately **0.10 and 0.16**. This indicates that even among the highest-revenue stores, relative sales stability differs meaningfully. Some large stores combine high sales with relatively stable demand, whereas others experience considerable week-to-week fluctuations despite similar revenue levels.


## Sales Concentration: Do a Few Stores Dominate Revenue?

```{r step2.2}
# ------------------------------------------------
# STEP 2.2 – Sales concentration plot
# ------------------------------------------------

top_sales <- store_summary %>%
  arrange(desc(total_sales)) %>%
  mutate(rank = row_number())

ggplot(top_sales, aes(x = rank, y = sales_share)) +
  geom_col() +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Store Ranking by Share of Total Sales",
    x = "Store Rank (1 = Highest Total Sales)",
    y = "Share of Total Sales"
  )
```

>The sales concentration analysis shows a pronounced skew in revenue contribution across stores. The highest-ranked stores each account for roughly 4–4.5% of total sales, while the sales share declines steadily across lower-ranked locations. This pattern confirms that Walmart’s overall revenue is not evenly distributed, but instead depends heavily on a relatively small group of high-performing stores. Consequently, changes in performance at these top locations are likely to have a disproportionate impact on total company sales.


## Volatility Concentration: Who Drives Weekly Instability?

```{r step2.3}
# ------------------------------------------------
# STEP 2.3 – Volatility concentration plot
# ------------------------------------------------

top_volatility <- store_summary %>%
  arrange(desc(sd_sales)) %>%
  mutate(rank = row_number())

ggplot(top_volatility, aes(x = rank, y = volatility_share)) +
  geom_col() +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Store Ranking by Share of Sales Volatility",
    x = "Store Rank (1 = Most Volatile)",
    y = "Share of Total Volatility"
  )
```

>A similar but distinct concentration pattern emerges when examining sales volatility. A limited number of stores contribute a disproportionately large share of total sales volatility, with the most volatile stores accounting for nearly 5% each. Importantly, volatility declines gradually across the ranking, indicating that instability is concentrated rather than widespread. This suggests that Walmart’s week-to-week sales fluctuations are primarily driven by specific locations rather than by uniform, chain-wide shocks.


## Are High-Sales Stores Also High-Volatility Stores?

```{r step2.4}
# ------------------------------------------------
# STEP 2.4 – Sales rank vs volatility rank
# ------------------------------------------------

store_ranks <- store_summary %>%
  mutate(
    sales_rank = rank(-total_sales, ties.method = "first"),
    vol_rank   = rank(-sd_sales, ties.method = "first")
  )

ggplot(store_ranks, aes(x = sales_rank, y = vol_rank)) +
  geom_point() +
  labs(
    title = "Relationship Between Sales Rank and Volatility Rank",
    x = "Sales Rank (1 = Highest Sales)",
    y = "Volatility Rank (1 = Highest Volatility)"
  )
```

>The comparison between sales rank and volatility rank indicates that high-revenue stores are not systematically the most volatile. While some high-sales locations also exhibit high volatility, many top-selling stores display moderate or low volatility rankings. Conversely, several lower-ranked sales stores appear among the most volatile. This dispersion suggests the presence of distinct store roles within the network, separating revenue-driving locations from volatility-driving ones.


## Step 2 Summary 

>Overall, the exploratory analysis confirms that Walmart’s sales and volatility are both highly concentrated but not perfectly aligned across stores. A small number of locations drive a large share of total revenue, while a partially different subset drives sales instability. These findings validate the CEO’s initial intuition and demonstrate that aggregate metrics mask important store-level dynamics. As a result, subsequent analyses should focus on identifying the drivers of volatility and sales at a more granular level, particularly examining the role of holidays and temporal effects.


# STEP 3 — Holiday Analysis

>In Step 2, we showed that sales and volatility are concentrated in specific stores. However, store-level patterns alone do not explain why weekly sales fluctuate. One frequently cited explanation is the effect of holidays. While the dataset provides a binary holiday indicator (Holiday_Flag), relying solely on a 0/1 variable may oversimplify the underlying dynamics. Therefore, this step examines holiday effects in a structured but focused manner, without losing sight of the broader analytical objective.


## Holiday vs Non-Holiday: Average Sales Comparison

**What we do**

We first compare average weekly sales during holiday weeks versus non-holiday weeks.
This provides an intuitive benchmark for understanding whether holidays matter at all.

```{r step3.1}
# ------------------------------------------------
# STEP 3.1 – Average sales: Holiday vs Non-Holiday
# ------------------------------------------------

holiday_sales_summary <- walmart %>%
  group_by(Holiday_Flag) %>%
  summarise(
    avg_weekly_sales = mean(Weekly_Sales, na.rm = TRUE),
    sd_weekly_sales  = sd(Weekly_Sales, na.rm = TRUE),
    n_weeks = n(),
    .groups = "drop"
  )

holiday_sales_summary

```
>The comparison of average weekly sales indicates that holiday weeks are associated with a different sales level than non-holiday weeks. The difference in mean sales suggests that holidays are not neutral periods and may systematically shift consumer purchasing behavior. However, this aggregate comparison does not yet indicate whether the observed difference is large enough to be economically meaningful or consistent across stores. “Holiday effect var gibi görünüyor” ama tek başına yeterli değil.

## Visual Comparison: Sales Distribution

**Why visualization matters**

Averages can hide variability. Therefore, we compare the distribution of weekly sales during holiday and non-holiday weeks.

```{r step3.2}
# ------------------------------------------------
# STEP 3.2 – Sales distribution by holiday status
# ------------------------------------------------

ggplot(walmart, aes(x = factor(Holiday_Flag), y = Weekly_Sales)) +
  geom_boxplot() +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Weekly Sales Distribution: Holiday vs Non-Holiday Weeks",
    x = "Holiday Flag (0 = Non-Holiday, 1 = Holiday)",
    y = "Weekly Sales"
  )

```

>The boxplot comparison shows that holiday weeks tend to shift the sales distribution upward relative to non-holiday weeks. At the same time, the wider spread during holiday periods indicates increased dispersion, suggesting that holidays may amplify not only sales levels but also sales uncertainty. This observation reinforces the need to examine volatility effects in addition to average sales differences.

## Holiday Lift: How Large Is the Effect?

**What is “holiday lift”?**

Holiday lift measures the percentage difference in average sales between holiday and non-holiday weeks.

```{r step3.3}
# ------------------------------------------------
# STEP 3.3 – Holiday lift calculation
# ------------------------------------------------

holiday_lift <- holiday_sales_summary %>%
  summarise(
    holiday_lift_pct =
      (avg_weekly_sales[Holiday_Flag == 1] -
       avg_weekly_sales[Holiday_Flag == 0]) /
       avg_weekly_sales[Holiday_Flag == 0] * 100
  )

holiday_lift

```

>The holiday lift calculation quantifies the magnitude of the holiday effect in percentage terms. This metric provides a more interpretable measure for managerial decision-making, as it translates raw sales differences into relative performance gains. A modest lift suggests that holidays play a secondary role, whereas a large lift would indicate that holiday-driven demand is a major contributor to overall sales performance.
Holiday’i sadece “var/yok” demiyoruz → etki büyüklüğünü ölçüyoruz.

## Holiday and Volatility

**Why this matters**

The CEO’s concern is not only about sales levels, but also about instability.
We therefore compare volatility across holiday and non-holiday weeks.

```{r step3.4}
# ------------------------------------------------
# STEP 3.4 – Volatility comparison
# ------------------------------------------------

holiday_volatility <- walmart %>%
  group_by(Holiday_Flag) %>%
  summarise(
    sd_weekly_sales = sd(Weekly_Sales, na.rm = TRUE),
    .groups = "drop"
  )

holiday_volatility

```

>The volatility comparison reveals that holiday weeks are associated with a different level of sales variability compared to non-holiday periods. This suggests that holidays may contribute not only to higher average sales, but also to increased uncertainty in weekly performance. Such a pattern is consistent with demand spikes that are unevenly distributed across stores and product categories.

## Zoom-in: Major Holiday Weeks (Black Friday–style Insight)

While the binary holiday flag captures whether a week is classified as a holiday, it does not distinguish between routine holidays and peak demand events such as major shopping periods. To better understand this limitation, we examine weekly sales patterns around the most prominent holiday periods.

```{r step3.5}
ggplot(walmart, aes(x = Date, y = Weekly_Sales)) +
  geom_line(alpha = 0.3) +
  geom_point(
    data = walmart %>% filter(Holiday_Flag == 1),
    aes(x = Date, y = Weekly_Sales),
    color = "red",
    alpha = 0.6
  ) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Weekly Sales Over Time with Holiday Weeks Highlighted",
    x = "Date",
    y = "Weekly Sales"
  )


```

>The time-series visualization reveals that holiday weeks are associated with noticeable sales spikes, but these spikes are **not uniform across all holiday periods.** While some holiday weeks correspond to exceptionally high sales levels, others remain close to the baseline observed during regular weeks. This heterogeneity indicates that the binary holiday indicator aggregates events with substantially different demand intensities. As a result, treating all holiday weeks as economically equivalent may obscure important variation in sales responses.

```{r step}
ggplot(
  walmart %>% filter(Holiday_Flag == 1),
  aes(x = Weekly_Sales)
) +
  geom_histogram(bins = 30, fill = "grey", color = "white") +
  scale_x_continuous(labels = comma) +
  labs(
    title = "Distribution of Weekly Sales During Holiday Weeks",
    x = "Weekly Sales",
    y = "Frequency"
  )

```

>The distribution of weekly sales during holiday weeks is highly dispersed and right-skewed, with a long upper tail extending toward very high sales values. This pattern suggests that a subset of holiday periods generates disproportionately large sales outcomes, while many holiday weeks exhibit more moderate performance. Such dispersion reinforces the limitation of a simple holiday flag and highlights the need for models that can distinguish between routine holidays and peak demand events.

## Step 3 Summary: What Holidays Explain — and What They Don’t

The extended holiday analysis demonstrates that holiday periods are generally associated with higher average sales and increased variability in weekly performance. However, the time-series and distributional evidence shows that holiday effects are highly heterogeneous: only a subset of holiday weeks generates extreme sales spikes, while others resemble non-holiday periods. This finding confirms that a binary holiday indicator provides an incomplete representation of holiday-driven demand. Consequently, while holidays play an important role in explaining sales fluctuations, they cannot fully account for observed patterns in sales concentration and volatility. This motivates the use of multivariate regression models that incorporate time trends, macroeconomic variables, and store-level heterogeneity to disentangle overlapping effects.

# STEP 4 — Regression Analysis: What Drives Weekly Sales?

**Steps 2 and 3** demonstrated that sales and volatility are unevenly distributed across stores and that holiday effects are heterogeneous. However, descriptive analysis alone cannot isolate the relative importance of holidays, time trends, and macroeconomic conditions. Therefore, this step employs regression models to quantify the drivers of weekly sales while controlling for overlapping effects. The objective is not prediction accuracy, but **explanatory clarity** for managerial decision-making.

## Model 1 — Baseline: Holiday Effect Only

**Why this model?**

We begin with the simplest possible specification to establish a benchmark.
This model answers the question: If we only consider whether a week is a holiday, how much does that explain sales?

```{r step4.1}
# ------------------------------------------------
# STEP 4.1 – Model 1: Holiday only
# ------------------------------------------------

model_1 <- lm(Weekly_Sales ~ Holiday_Flag, data = walmart)

summary(model_1)

```


- Holiday coefficient ≈ **+81,600**

- Statistically significant **(p ≈ 0.003)**

- R² ≈ **0.0014**

>The baseline regression shows that holiday weeks are associated with an average increase of approximately 81,600 in weekly sales relative to non-holiday weeks, and this effect is statistically significant. However, the model’s explanatory power is extremely limited, with an R-squared value close to zero. This indicates that although holidays are associated with higher sales on average, they explain only a negligible fraction of the overall variation in weekly sales. Consequently, holiday effects alone are insufficient to understand Walmart’s sales dynamics.

## Model 2 — Adding Time Structure (Trend & Seasonality)

**Why add time?**

Sales naturally evolve over time due to inflation, store maturity, and seasonality.
Ignoring time effects risks attributing long-term trends to holidays.

```{r step4.2}
# ------------------------------------------------
# STEP 4.2 – Model 2: Holiday + time effects
# ------------------------------------------------

model_2 <- lm(
  Weekly_Sales ~ Holiday_Flag + Year + Month,
  data = walmart
)

summary(model_2)

```

- Holiday_Flag → not meaningful statistically

- Year → not meaningful

- Month effects → strong

- R² ≈ 0.019

>After controlling for time effects, the estimated holiday coefficient becomes statistically insignificant, suggesting that part of the apparent holiday effect observed in the baseline model was capturing underlying seasonal patterns. Several month indicators are highly significant, highlighting strong seasonality in weekly sales. Although the overall explanatory power remains modest, the increase in R-squared relative to the baseline model demonstrates that temporal structure plays a substantially more important role than holidays alone.

## Model 3 — Full Model: Holiday + Time + Macroeconomic Factors

**Why macro variables?**

Consumer spending is influenced by broader economic conditions.
This model evaluates whether sales respond to CPI, unemployment, fuel prices, and weather, beyond holidays and time.

```{r step4.3}
# ------------------------------------------------
# STEP 4.3 – Model 3: Full explanatory model
# ------------------------------------------------

model_3 <- lm(
  Weekly_Sales ~ Holiday_Flag + Year + Month +
    Temperature + Fuel_Price + CPI + Unemployment,
  data = walmart
)

summary(model_3)

```

- Holiday_Flag → still not meaningful

- Year → negative and meaningful

- CPI → negative and very strong

- Unemployment → negative and very strong

- Fuel_Price → positive and meaningful

- Temperature → not meaningful

- R² ≈ 0.044


>The full model provides the highest explanatory power among all specifications, indicating that weekly sales are influenced by a combination of temporal and macroeconomic factors. Once these variables are included, the holiday indicator remains statistically insignificant, confirming that holidays do not independently drive sales when broader economic conditions are taken into account. In contrast, unemployment and CPI exhibit strong negative relationships with sales, suggesting that macroeconomic pressure significantly dampens consumer spending. The positive coefficient on fuel prices may reflect correlated economic activity rather than a direct causal effect. Overall, these results indicate that macroeconomic conditions dominate holiday timing in explaining sales variation.

## Comparing Models: What Really Adds Value?

```{r step4.4}
# ------------------------------------------------
# STEP 4.4 – Model comparison
# ------------------------------------------------

model_comparison <- tibble(
  Model = c("Holiday only", "Holiday + Time", "Full Model"),
  R_squared = c(
    summary(model_1)$r.squared,
    summary(model_2)$r.squared,
    summary(model_3)$r.squared
  )
)

model_comparison


```

**Actual R² values:**

- Holiday only → **0.001**

- Holiday + Time → **0.019**

- Full Model → **0.044**


>Model comparison reveals a clear hierarchy in explanatory power. The holiday-only model explains virtually none of the variation in weekly sales. Introducing time effects substantially improves model performance, while the inclusion of macroeconomic variables further doubles explanatory power. Despite this improvement, the relatively low R-squared values across all models indicate that a significant portion of sales variation remains store-specific or idiosyncratic, reinforcing the importance of store-level segmentation in subsequent analysis.


## Step 4 Executive Summary

>The regression analysis demonstrates that while holiday weeks are associated with higher sales in isolation, this effect disappears once time structure and macroeconomic conditions are accounted for. Seasonality and broader economic indicators, particularly unemployment and inflation, play a far more significant role in shaping weekly sales patterns. These findings suggest that managerial decisions focused solely on holiday timing may be misguided unless they are aligned with prevailing economic conditions. Furthermore, the limited explanatory power of aggregate models highlights the need to incorporate store-level heterogeneity, motivating a segmentation-based approach in the next step.

# STEP 5 — Store Segmentation (K-Means Clustering) 

Although regression models capture general patterns, they explain only a limited portion of weekly sales variation, suggesting that store-level differences play a major role. Therefore, we segment stores into meaningful groups using k-means clustering. The objective is to create actionable store segments (e.g., “high sales–stable” vs. “high volatility”) that can support differentiated managerial strategies for planning, resource allocation, and promotion design.

## Create Store-Level Features for Clustering

**What we include**

We build a store-level dataset with features that capture:

- **Sales level** (average weekly sales)

- **Sales volatility** (CV or SD)

- **Holiday sensitivity** (holiday lift at store level)

```{r step5.1}
# ------------------------------------------------
# STEP 5.1 – Store-level dataset for clustering
# ------------------------------------------------

store_cluster_df <- walmart %>%
  group_by(Store) %>%
  summarise(
    avg_sales = mean(Weekly_Sales, na.rm = TRUE),
    sd_sales  = sd(Weekly_Sales, na.rm = TRUE),
    cv_sales  = sd_sales / avg_sales,
    
    # Holiday sensitivity: difference between holiday and non-holiday mean sales per store
    holiday_mean = mean(Weekly_Sales[Holiday_Flag == 1], na.rm = TRUE),
    nonholiday_mean = mean(Weekly_Sales[Holiday_Flag == 0], na.rm = TRUE),
    holiday_lift_pct = (holiday_mean - nonholiday_mean) / nonholiday_mean * 100,
    
    .groups = "drop"
  )

# Check for infinite / NA lift values (can happen if a store has no non-holiday weeks)
store_cluster_df <- store_cluster_df %>%
  mutate(
    holiday_lift_pct = ifelse(is.finite(holiday_lift_pct), holiday_lift_pct, NA_real_)
  )

summary(store_cluster_df)

```

>This feature engineering step translates the weekly dataset into a store-level analytic view. In addition to sales level and volatility, we explicitly include holiday sensitivity as a behavioral feature. This ensures that segmentation is not based solely on store size, but also captures how stores respond to peak-demand periods.

## Prepare Data for K-Means (Scaling)

K-means is distance-based, so variables with large scales dominate results.
Therefore, we standardize variables before clustering.

```{r step5.2}
# ------------------------------------------------
# STEP 5.2 – Scale variables for k-means
# ------------------------------------------------

cluster_features <- store_cluster_df %>%
  select(avg_sales, cv_sales, holiday_lift_pct) %>%
  drop_na()  # remove stores with missing holiday lift

cluster_scaled <- scale(cluster_features)

```

>Scaling ensures that average sales, volatility, and holiday sensitivity contribute equally to the clustering solution. Without scaling, average sales would dominate the segmentation and reduce interpretability.

## Choose the Number of Clusters (Elbow Method)

```{r step5.3}
# ------------------------------------------------
# STEP 5.3 – Elbow method
# ------------------------------------------------

set.seed(123)

wss <- map_dbl(1:10, function(k) {
  kmeans(cluster_scaled, centers = k, nstart = 25)$tot.withinss
})

elbow_df <- tibble(k = 1:10, wss = wss)

ggplot(elbow_df, aes(x = k, y = wss)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Elbow Method for Choosing k",
    x = "Number of clusters (k)",
    y = "Total within-cluster sum of squares"
  )

```

>The elbow plot shows a **sharp reduction** in within-cluster variation up to three clusters, after which the marginal improvement becomes noticeably smaller. This clear “elbow” at **k = 3** indicates that a three-segment solution provides an appropriate balance between explanatory power and interpretability. Therefore, k = 3 is selected for the final clustering analysis.

## Run K-Means with Chosen k


```{r step5.4}
# ------------------------------------------------
# STEP 5.4 – K-means clustering
# ------------------------------------------------

set.seed(123)
k <- 3

kmeans_model <- kmeans(cluster_scaled, centers = k, nstart = 25)

store_clustered <- store_cluster_df %>%
  drop_na(holiday_lift_pct) %>%
  mutate(Segment = factor(kmeans_model$cluster))

store_clustered %>% count(Segment)

```

## Segment Profiling (What Differentiates Segments?)

```{r step5.5}
# ------------------------------------------------
# STEP 5.5 – Segment summary table
# ------------------------------------------------

segment_profile <- store_clustered %>%
  group_by(Segment) %>%
  summarise(
    n_stores = n(),
    avg_sales_mean = mean(avg_sales),
    cv_sales_mean = mean(cv_sales),
    holiday_lift_mean = mean(holiday_lift_pct),
    .groups = "drop"
  )

segment_profile
```

| Segment                   | Interpretation                                               |
| ------------------------- | ------------------------------------------------------------ |
| **Segment 1 (8 stores)**  | Low sales – very stable – negative holiday lift              |
| **Segment 2 (14 stores)** | Mid sales – highly volatile – strongly holiday-driven        |
| **Segment 3 (23 stores)** | High sales – moderately volatile – positive holiday response |


>The segmentation results reveal three clearly differentiated store types. 
**Segment 1** consists of low-sales but highly stable stores, with a slightly negative average holiday lift, indicating limited responsiveness to holiday periods. 
**Segment 2** is characterized by moderate sales levels combined with the highest relative volatility and the strongest positive holiday response, suggesting that demand in these stores is both event-driven and unstable. 
**Segment 3** includes high-sales stores with moderate volatility and a consistently positive holiday lift, positioning them as core revenue drivers with predictable holiday benefits.


## Visualize Segments 

**Plot A: Sales vs Volatility by Segment**

```{r step5.6}
ggplot(store_clustered, aes(x = avg_sales, y = cv_sales, shape = Segment)) +
  geom_point(size = 3, alpha = 0.8) +
  scale_x_continuous(labels = comma) +
  labs(
    title = "Store Segments: Average Sales vs Relative Volatility",
    x = "Average Weekly Sales",
    y = "Coefficient of Variation (CV)"
  )
```

>The sales–volatility scatter plot highlights clear structural differences across segments.
**Segment 1** stores cluster at lower sales levels with minimal volatility, indicating predictable but limited revenue contribution. 
**Segment 2** displays the highest dispersion in volatility for similar sales levels, reinforcing their classification as unstable and demand-sensitive locations. 
**Segment 3** stores dominate the higher sales range while maintaining moderate volatility, confirming their role as Walmart’s primary revenue engines.

**Plot B: Holiday Sensitivity by Segment**

```{r step5.6.b}
ggplot(store_clustered, aes(x = Segment, y = holiday_lift_pct)) +
  geom_boxplot() +
  labs(
    title = "Holiday Lift Distribution by Store Segment",
    x = "Segment",
    y = "Holiday Lift (%)"
  )

```

>The holiday lift distribution further differentiates store segments. 
**Segment 1** shows a slightly negative holiday effect, suggesting that holiday periods do not generate incremental demand in these locations. 
**In contrast, Segment 2** exhibits the highest and most dispersed holiday lift, indicating strong but unpredictable responses to holiday events. 
**Segment 3** demonstrates a positive and more stable holiday lift, implying that holiday promotions reliably enhance sales in high-performing stores.

## Step 5 Executive Summary

>The k-means clustering results suggest that Walmart stores can be grouped into distinct, interpretable segments based on sales level, sales stability, and holiday sensitivity. This segmentation enables **differentiated** strategies: **high-sales stable stores** may benefit from operational optimization and inventory prioritization, **volatility-driven stores** require demand smoothing and risk management, and **holiday-responsive stores** can be targeted with event-based promotions. 
Overall, segmentation offers a practical bridge between analytics and managerial action.

# Final Managerial Recommendations

## Overview

The analysis reveals that Walmart’s sales performance and volatility are not driven by a single universal factor such as holidays or macroeconomic conditions. Instead, store-level heterogeneity plays a dominant role. Based on the k-means segmentation results, three distinct store segments emerge, each requiring a differentiated managerial approach. The recommendations below translate the analytical findings into concrete, actionable strategies.

---

## Segment 1: Stable Low-Impact Stores

**Profile**
- Low average weekly sales  
- Very low relative volatility  
- Slightly negative or negligible holiday lift  

**Managerial Implications**

These stores exhibit predictable demand patterns but contribute limited incremental revenue, particularly during holiday periods. Aggressive promotional activity in these locations is unlikely to yield meaningful returns.

**Recommended Actions**
- Avoid heavy holiday-specific promotions  
- Focus on cost efficiency and operational stability  
- Use these stores as baseline demand benchmarks for



# Conclusion, Limitations, and Future Research

## Conclusion

This study addressed a focused set of research questions aimed at understanding Walmart’s sales dynamics from both an analytical and managerial perspective.

Regarding **RQ1**, the analysis showed that both sales and sales volatility are highly concentrated across stores, but not perfectly aligned. A relatively small number of stores account for a disproportionate share of total sales, while a partially different subset drives overall volatility. This finding confirms that aggregate performance metrics mask important store-level differences.

With respect to **RQ2**, the holiday analysis demonstrated that holiday periods are associated with higher average sales and increased dispersion. However, the effect of holidays is highly heterogeneous: only a subset of holiday weeks generates extreme sales spikes, while others resemble regular periods. This confirms that a simple binary holiday indicator provides an incomplete representation of holiday-driven demand.

In answering **RQ3**, regression results revealed that while holidays appear significant in isolation, their effect disappears once time structure and macroeconomic conditions are controlled for. Seasonality and broader economic indicators—particularly unemployment and inflation—play a substantially more important role in explaining weekly sales variation. These findings suggest that holiday timing alone is insufficient for understanding or predicting sales performance.

Finally, **RQ4** demonstrated that store segmentation offers a practical solution to the limitations of aggregate modeling. Using k-means clustering, stores were grouped into three distinct segments based on sales level, volatility, and holiday sensitivity. These segments exhibit clearly differentiated behavioral patterns, enabling targeted managerial strategies rather than uniform, chain-wide policies.

Overall, the results highlight the importance of combining descriptive analysis, explanatory modeling, and segmentation to translate complex sales data into actionable insights. By aligning analytical methods with well-defined research questions, the study provides a structured framework for data-driven decision-making in retail marketing analytics.


---

## Limitations

Despite its insights, this analysis is subject to several limitations. First, the dataset does not distinguish between different types of holidays (e.g., major shopping events versus routine public holidays). As a result, holiday effects are captured in an aggregated manner, which may mask important differences in demand intensity across specific events. Second, the regression models rely on aggregate weekly sales and do not incorporate product-level or customer-level information, limiting the ability to analyze category-specific or behavioral drivers of demand. 

Additionally, while macroeconomic variables improve explanatory power, the relatively low R-squared values suggest that unobserved factors—such as local competition, store-specific promotions, or regional demographics—may play a significant role. Finally, the clustering results depend on the selected variables and number of clusters; alternative specifications could yield slightly different segment structures.

---

## Future Research Directions

Future research could extend this analysis in several ways. First, incorporating more granular holiday classifications (e.g., separating major promotional events from minor holidays) would allow for a more precise estimation of event-specific effects. Second, integrating store-level promotional data, pricing information, or product-category sales could improve explanatory power and enable deeper behavioral insights. 

Moreover, advanced modeling techniques such as mixed-effects models or panel data approaches could explicitly account for store-specific fixed effects and dynamic responses over time. Finally, linking sales performance to customer or regional demographic data would support more personalized and location-specific marketing strategies, further enhancing the managerial relevance of the analysis.


