Project 2 Mehreen Ali Gillani

Walmart Sales Data 2010-2012

This is a Walmart sales data available on kaggle. We will study the sales data of one of the largest retailers in the world. Let’s figure out what factors influence its revenue. Can factors such as air temperature and fuel cost influence the success of a huge company along with the purchasing power index and Unemployment rate?

The data contains the following columns:
Store: Store number
Date: Sales week start date
Weekly_Sales: Sales
Holiday_Flag: Mark on the presence or absence of a holiday
Temperature: Air temperature in the region
Fuel_Price: Fuel cost in the region
CPI: Consumer price index
Unemployment: Unemployment rate

Step 1. Read csv file from Kaggle link:

library(tidyverse) #include dplyr, tidyr, ggplot2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

url = 'https://storage.googleapis.com/kagglesdsdata/datasets/4438189/8620416/Walmart_Sales.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20251003%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251003T151856Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=180c63c0d5ae7c535cb4862d6966d553b1623464927b895695ea1a76ec7ac151223542a134db86ba769f600ea64aea9a49886e85f225b13396f59418f442f95ad6bb614e47523cb595cdcf673e5757f7d4766ecc240878999b8679032898d752dae210d7a19cd8bf7998472ddcebeecb3d2fbaa14b38339700e4867fc097ea007354558357f402cd37e8020e0dbdd3a7aba364d23e52d3c4271819cd17e1b3fa7fa100394572d26e96f7e60b182ac6588fa35fce3613d289a6c91318b0f43bb0461b0f0de7606f4652e7843af08e1b0133481ff1a620e5a998f50e903b535f7fa166e11cd301188007f9267ed3f7830864e37b57ea25adbff171e70c9c15b7a5'

walmart_sales <- read.csv(url) #read csv from Kaggle url

Step 2: Initial Data Exploration

cat("Dataset Structure:\n")

## Dataset Structure:

str(walmart_sales)

## 'data.frame':    6435 obs. of  8 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr  "05-02-2010" "12-02-2010" "19-02-2010" "26-02-2010" ...
##  $ Weekly_Sales: num  1643691 1641957 1611968 1409728 1554807 ...
##  $ Holiday_Flag: int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Temperature : num  42.3 38.5 39.9 46.6 46.5 ...
##  $ Fuel_Price  : num  2.57 2.55 2.51 2.56 2.62 ...
##  $ CPI         : num  211 211 211 211 211 ...
##  $ Unemployment: num  8.11 8.11 8.11 8.11 8.11 ...

cat("\nFirst few rows:\n")

## 
## First few rows:

head(walmart_sales)

##   Store       Date Weekly_Sales Holiday_Flag Temperature Fuel_Price      CPI
## 1     1 05-02-2010      1643691            0       42.31      2.572 211.0964
## 2     1 12-02-2010      1641957            1       38.51      2.548 211.2422
## 3     1 19-02-2010      1611968            0       39.93      2.514 211.2891
## 4     1 26-02-2010      1409728            0       46.63      2.561 211.3196
## 5     1 05-03-2010      1554807            0       46.50      2.625 211.3501
## 6     1 12-03-2010      1439542            0       57.79      2.667 211.3806
##   Unemployment
## 1        8.106
## 2        8.106
## 3        8.106
## 4        8.106
## 5        8.106
## 6        8.106

Step 3: Show data summary

cat("\nSummary statistics:\n")

## 
## Summary statistics:

summary(walmart_sales)

##      Store        Date            Weekly_Sales      Holiday_Flag    
##  Min.   : 1   Length:6435        Min.   : 209986   Min.   :0.00000  
##  1st Qu.:12   Class :character   1st Qu.: 553350   1st Qu.:0.00000  
##  Median :23   Mode  :character   Median : 960746   Median :0.00000  
##  Mean   :23                      Mean   :1046965   Mean   :0.06993  
##  3rd Qu.:34                      3rd Qu.:1420159   3rd Qu.:0.00000  
##  Max.   :45                      Max.   :3818686   Max.   :1.00000  
##   Temperature       Fuel_Price         CPI         Unemployment   
##  Min.   : -2.06   Min.   :2.472   Min.   :126.1   Min.   : 3.879  
##  1st Qu.: 47.46   1st Qu.:2.933   1st Qu.:131.7   1st Qu.: 6.891  
##  Median : 62.67   Median :3.445   Median :182.6   Median : 7.874  
##  Mean   : 60.66   Mean   :3.359   Mean   :171.6   Mean   : 7.999  
##  3rd Qu.: 74.94   3rd Qu.:3.735   3rd Qu.:212.7   3rd Qu.: 8.622  
##  Max.   :100.14   Max.   :4.468   Max.   :227.2   Max.   :14.313

Dataset Characteristics:

Store Coverage: 45 distinct retail locations
Temporal Features:
Date field (currently character format - requires conversion to proper date type)
Holiday indicator (binary: 0 = regular day, 1 = holiday)
Economic Metrics:
Fuel prices ranging from $2.50 to $4.50
Weekly sales data Wide spread from $209K to $3.82M
Consumer prices vary widely (126-227 CPI range) with a central tendency around 183,
indicating generally elevated but inconsistent economic conditions across stores.
Environmental Factor:
Temperature range: -2°F to 100°F with mean of 60.7°F
The unemployment data shows considerable disparity (3.9%-14.3%), with a central tendency near 8%,
reflecting diverse economic health across different store markets

Step 4: Data Cleaning and Type Conversion

# Clean and transform the data
walmart_clean <- walmart_sales %>%
  # Convert date to proper format and extract useful time features
  mutate(
    Date = as.Date(Date, format = "%d-%m-%Y"),
    year = year(Date),
    month = month(Date, label = TRUE),
    week = week(Date),
    day_of_week = wday(Date, label = TRUE),
    quarter = quarter(Date),
    # Convert categorical variables to factors
    Holiday_Flag = factor(Holiday_Flag, levels = c(0, 1), labels = c("non_holiday", "holiday")),
    # Create sales categories
    weekly_sales_category = case_when(
      Weekly_Sales < 1000000 ~ "Low",
      Weekly_Sales >= 1000000 & Weekly_Sales < 2000000 ~ "Medium",
      Weekly_Sales >= 2000000 ~ "High"
    ),
    # Create temperature segments
    temp_segment = cut(Temperature, 
                      breaks = c(-Inf, 30, 60, 90, Inf),
                      labels = c("Very Cold", "Cold", "Moderate", "Hot"))
  ) %>%
  # Remove any potential duplicates
  distinct() %>%
  # Handle missing values if any
  drop_na()

 # dplyr method to convert all column names to lower case letters
  walmart_clean <- walmart_clean %>%
  rename_all(tolower)

cat("After cleaning - Dataset structure:\n")

## After cleaning - Dataset structure:

glimpse(walmart_clean)

## Rows: 6,435
## Columns: 15
## $ store                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ date                  <date> 2010-02-05, 2010-02-12, 2010-02-19, 2010-02-26,…
## $ weekly_sales          <dbl> 1643691, 1641957, 1611968, 1409728, 1554807, 143…
## $ holiday_flag          <fct> non_holiday, holiday, non_holiday, non_holiday, …
## $ temperature           <dbl> 42.31, 38.51, 39.93, 46.63, 46.50, 57.79, 54.58,…
## $ fuel_price            <dbl> 2.572, 2.548, 2.514, 2.561, 2.625, 2.667, 2.720,…
## $ cpi                   <dbl> 211.0964, 211.2422, 211.2891, 211.3196, 211.3501…
## $ unemployment          <dbl> 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106,…
## $ year                  <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
## $ month                 <ord> Feb, Feb, Feb, Feb, Mar, Mar, Mar, Mar, Apr, Apr…
## $ week                  <dbl> 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ day_of_week           <ord> Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri…
## $ quarter               <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ weekly_sales_category <chr> "Medium", "Medium", "Medium", "Medium", "Medium"…
## $ temp_segment          <fct> Cold, Cold, Cold, Cold, Cold, Cold, Cold, Cold, …

unique(walmart_clean$year)

## [1] 2010 2011 2012

dim(walmart_clean)

## [1] 6435   15

Data dimension: 6435 rows with 15 columns Data is from 2010, 2011 and 2012

Step 5: Data Reshaping with tidyr

# Create a wide format for store comparisons
store_performance_wide <- walmart_clean %>%
  group_by(store, year, quarter) %>%
  summarise(
    total_sales = sum(weekly_sales),
    avg_sales = mean(weekly_sales),
    max_sales = max(weekly_sales),
    min_sales = min(weekly_sales),
    .groups = 'drop'
  ) %>%
  pivot_wider(
    names_from = quarter,
    values_from = c(total_sales, avg_sales, max_sales, min_sales),
    names_sep = "_Q"
  )

cat("Wide format - Store performance by quarter:\n")

## Wide format - Store performance by quarter:

glimpse(store_performance_wide)

## Rows: 135
## Columns: 18
## $ store          <int> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7…
## $ year           <dbl> 2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2012, 2…
## $ total_sales_Q1 <dbl> 12178638, 18187314, 20723763, 15942113, 22490219, 24528…
## $ total_sales_Q2 <dbl> 19436822, 20012527, 20978760, 25367304, 23905955, 25083…
## $ total_sales_Q3 <dbl> 19150230, 20752182, 20253948, 24294571, 24852063, 24303…
## $ total_sales_Q4 <dbl> 22513142, 21969895, 6245587, 29673876, 27359644, 758151…
## $ avg_sales_Q1   <dbl> 1522329.8, 1515609.5, 1594135.6, 1992764.2, 1874185.0, …
## $ avg_sales_Q2   <dbl> 1495140.2, 1539425.2, 1613750.8, 1951331.1, 1838919.6, …
## $ avg_sales_Q3   <dbl> 1473094.6, 1482298.7, 1557996.0, 1868813.2, 1775147.3, …
## $ avg_sales_Q4   <dbl> 1608081.6, 1689991.9, 1561396.8, 2119562.5, 2104588.0, …
## $ max_sales_Q1   <dbl> 1643690.9, 1686842.8, 1819870.0, 2137809.5, 2168041.6, …
## $ max_sales_Q2   <dbl> 1615524.7, 1635078.4, 1899676.9, 2102539.9, 1953772.0, …
## $ max_sales_Q3   <dbl> 1605491.8, 1624383.8, 1769854.2, 2003940.6, 1876704.3, …
## $ max_sales_Q4   <dbl> 2387950.2, 2270189.0, 1670786.0, 3436007.7, 3224369.8, …
## $ min_sales_Q1   <dbl> 1404429.9, 1316899.3, 1319325.6, 1750197.8, 1695371.7, …
## $ min_sales_Q2   <dbl> 1391256.1, 1428218.3, 1468928.4, 1802450.3, 1688281.9, …
## $ min_sales_Q3   <dbl> 1351791.0, 1352219.8, 1437059.3, 1724557.2, 1650394.4, …
## $ min_sales_Q4   <dbl> 1345454.0, 1445249.1, 1493659.7, 1737947.6, 1743882.2, …

# Create quarterly_sales_long first
quarterly_sales_long <- store_performance_wide %>%
  select(store, year, contains("total_sales")) %>%
  pivot_longer(
    cols = contains("total_sales"),
    names_to = "quarter",
    values_to = "sales",
    names_prefix = "total_sales_Q"
  ) %>%
  mutate(quarter = paste0("Q", quarter))

# Better way to find top stores (using total sales across all quarters)
top_stores <- quarterly_sales_long %>%
  group_by(store) %>%
  summarise(total_sales = sum(sales, na.rm = TRUE)) %>%
  slice_max(total_sales, n = 8) %>%
  pull(store)

# Create the plot
quarterly_trends <- quarterly_sales_long %>%
  filter(store %in% top_stores)

ggplot(quarterly_trends, aes(x = quarter, y = sales, group = store, color = factor(store))) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  facet_wrap(~ year) +
  labs(
    title = "Quarterly Sales Trends - Top 8 Performing Stores",
    x = "Quarter",
    y = "Quarterly Sales",
    color = "Store ID"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::dollar)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Findings from Quarterly Sales Trends - Top 8 Performing Stores:

Q4 Performance: 2010 and 2011 showed peak sales in Q4, while 2012 Q4 dropped to the lowest level.

2011 Trend: Consistent quarter-over-quarter sales growth throughout the year.

2010 Pattern: Q1 started low, increased in Q2, and peaked in Q4.

2012 Anomaly: Moderate sales in Q1-Q3 followed by a sharp Q4 decline.

To assess data completeness, we need to verify the number of weeks recorded per store each year. The observed sales decline in 2012 Q4 may stem from incomplete data coverage rather than actual performance issues.

weeks_per_store_year <- walmart_clean %>%
  group_by(store, year) %>%
  summarise(
    total_weeks = n(),
    unique_weeks = n_distinct(week),
    .groups = 'drop'
  )

# View the results
view(weeks_per_store_year)
print(weeks_per_store_year, n = 20)

## # A tibble: 135 × 4
##    store  year total_weeks unique_weeks
##    <int> <dbl>       <int>        <int>
##  1     1  2010          48           48
##  2     1  2011          52           52
##  3     1  2012          43           43
##  4     2  2010          48           48
##  5     2  2011          52           52
##  6     2  2012          43           43
##  7     3  2010          48           48
##  8     3  2011          52           52
##  9     3  2012          43           43
## 10     4  2010          48           48
## 11     4  2011          52           52
## 12     4  2012          43           43
## 13     5  2010          48           48
## 14     5  2011          52           52
## 15     5  2012          43           43
## 16     6  2010          48           48
## 17     6  2011          52           52
## 18     6  2012          43           43
## 19     7  2010          48           48
## 20     7  2011          52           52
## # ℹ 115 more rows

Data coverage analysis reveals the sales decline in 2012 Q4 is attributable to incomplete data rather than performance issues. While 2011 has complete 52-week records, 2012 contains only 43 weeks of data, creating artificial declines in the quarterly analysis.

Step 6: Advanced Data Transformations

6.1 Transformation 1: Store-level performance metrics

# Create multiple transformed views of the data


store_metrics <- walmart_clean %>%
  group_by(store, year) %>%
  summarise(
    #Total annual sales for each store-year combination
    total_revenue = sum(weekly_sales),
    #Average weekly sales performance
    #Helps compare stores of different sizes on a normalized basis
    avg_weekly_sales = mean(weekly_sales),
    #Standard deviation of weekly sales
    #Measures how consistent or unpredictable sales are throughout the year
    sales_volatility = sd(weekly_sales),
    #Compares average sales during holiday weeks vs non-holiday weeks
    holiday_boost = mean(weekly_sales[holiday_flag == "holiday"]) / 
                    mean(weekly_sales[holiday_flag == "non_holiday"]),
    best_month = month[which.max(weekly_sales)][1],
    worst_month = month[which.min(weekly_sales)][1],
    observation_count = n(),
    .groups = 'drop'
  ) %>%
  mutate(
    #Divides all stores into 4 equal groups (quartiles) based on avg_weekly_sales
    #Each store gets a number 1-4 representing which quartile they belong to
    performance_tier = ntile(avg_weekly_sales, 4),
    performance_label = case_when(
      performance_tier == 1 ~ "Low",
      performance_tier == 2 ~ "Medium",
      performance_tier == 3 ~ "High",
      performance_tier == 4 ~ "Elite"
    )
  )
view(store_metrics)

Holiday performance ratio

Values > 1 = stores perform better during holidays Values < 1 = stores perform worse during holidays High volatility = unpredictable sales patterns

6.2 Transformation 2: Time series analysis by store and week

# Load the zoo package for rolling mean calculations
library('zoo')

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

# Create time series features for sales analysis
time_series_data <- walmart_clean %>%
  # Sort data chronologically for each store to ensure proper time sequence
  arrange(store, year, month, week) %>%
  # Group by store to calculate metrics within each store separately
  group_by(store) %>%
  # Create new time-based metrics for each store
  mutate(
    # Get previous week's sales for comparison (1-week lag)
    sales_lag1 = lag(weekly_sales, 1),
    # Calculate week-over-week sales growth percentage
    sales_growth = (weekly_sales / sales_lag1 - 1) * 100,
    # Compute 4-week moving average to smooth out weekly fluctuations
    rolling_avg_4wk = zoo::rollmean(weekly_sales, 4, fill = NA, align = 'right')
  ) %>%
  # Remove grouping to return to normal dataframe structure
  ungroup()

# Select only the key columns for analysis and reporting
selected_data <- time_series_data %>%
  select(store, year, month, week, weekly_sales, sales_lag1, sales_growth, rolling_avg_4wk)

# Open interactive data viewer to explore the transformed dataset
view(selected_data)

# Display first few rows in console for quick inspection
head(selected_data)

## # A tibble: 6 × 8
##   store  year month  week weekly_sales sales_lag1 sales_growth rolling_avg_4wk
##   <int> <dbl> <ord> <dbl>        <dbl>      <dbl>        <dbl>           <dbl>
## 1     1  2010 Feb       6     1643691.        NA        NA                 NA 
## 2     1  2010 Feb       7     1641957.   1643691.       -0.105             NA 
## 3     1  2010 Feb       8     1611968.   1641957.       -1.83              NA 
## 4     1  2010 Feb       9     1409728.   1611968.      -12.5          1576836.
## 5     1  2010 Mar      10     1554807.   1409728.       10.3          1554615.
## 6     1  2010 Mar      11     1439542.   1554807.       -7.41         1504011.

6.3 Transformation 3 time series analysis for store and MONTHLY rolling mean

# monthly analysis (aggregate weekly data into monthly totals first)
monthly_analysis <- walmart_clean %>%
  # Group by store and month to create monthly aggregates
  group_by(store, year, month) %>%
  # Calculate monthly totals from weekly data
  summarise(
    monthly_sales = sum(weekly_sales),
    .groups = 'drop'
  ) %>%
  # Sort chronologically
  arrange(store, year, month) %>%
  # Group by store for time series calculations
  group_by(store) %>%
  mutate(
    # Get previous month's total sales
    monthly_sales_lag1 = lag(monthly_sales, 1),
    # Calculate month-over-month growth
    monthly_growth = (monthly_sales / monthly_sales_lag1 - 1) * 100,
    # Compute 3-month rolling average of monthly totals
    rolling_avg_3month = zoo::rollmean(monthly_sales, 3, fill = NA, align = 'right')
  ) %>%
  ungroup()

# View monthly analysis
view(monthly_analysis)
head(monthly_analysis)

## # A tibble: 6 × 7
##   store  year month monthly_sales monthly_sales_lag1 monthly_growth
##   <int> <dbl> <ord>         <dbl>              <dbl>          <dbl>
## 1     1  2010 Feb        6307344.                NA           NA   
## 2     1  2010 Mar        5871294.           6307344.          -6.91
## 3     1  2010 Apr        7422802.           5871294.          26.4 
## 4     1  2010 May        5929939.           7422802.         -20.1 
## 5     1  2010 Jun        6084081.           5929939.           2.60
## 6     1  2010 Jul        7244483.           6084081.          19.1 
## # ℹ 1 more variable: rolling_avg_3month <dbl>

6.4 Transformation 4: Holiday impact analysis

holiday_analysis <- walmart_clean %>%
  group_by(store, holiday_flag) %>%
  summarise(
    avg_sales = mean(weekly_sales),
    median_sales = median(weekly_sales),
    sales_count = n(),
    .groups = 'drop'
  ) %>%
  group_by(store) %>%
  arrange(store, holiday_flag) %>%  # Sort by store and holiday_flag (0 then 1)
  mutate(
    holiday_premium = (avg_sales / lag(avg_sales) - 1) * 100  
    # Calculates percentage difference: 
    # For holiday_flag=1 row: (holiday_sales / previous_non_holiday_sales - 1) * 100
    # Shows how much higher/lower holiday sales are compared to non-holiday sales
  ) %>%
  filter(!is.na(holiday_premium))  # Keep only rows where calculation worked (holiday_flag=1 rows)
view(holiday_analysis)

Final filter keeps only holiday rows with the calculated premium Example: If non-holiday sales = $100 and holiday sales = $125:

(125 / 100 - 1) * 100 = 25% holiday premium

6.5 Weather and economic impact

# Analyze external factors
external_factors_analysis <- walmart_clean %>%
  group_by(store, temp_segment) %>%
  summarise(
    avg_sales = mean(weekly_sales),
    avg_temperature = mean(temperature),
    avg_fuel_price = mean(fuel_price),
    avg_cpi = mean(cpi),
    avg_unemployment = mean(unemployment),
    observation_count = n(),
    .groups = 'drop'
  ) %>%
  arrange(store, avg_temperature)
external_factors_analysis

## # A tibble: 131 × 8
##    store temp_segment avg_sales avg_temperature avg_fuel_price avg_cpi
##    <int> <fct>            <dbl>           <dbl>          <dbl>   <dbl>
##  1     1 Cold          1600360.            49.9           3.04    215.
##  2     1 Moderate      1535739.            75.6           3.29    216.
##  3     1 Hot           1574765.            91.2           3.66    216.
##  4     2 Cold          2043379.            49.0           3.05    215.
##  5     2 Moderate      1875919.            76.0           3.28    216.
##  6     2 Hot           1827482.            91.1           3.59    217.
##  7     3 Cold           431030.            52.3           3.00    218.
##  8     3 Moderate       394864.            76.7           3.28    220.
##  9     4 Very Cold     2188307.            28.8           2.57    126.
## 10     4 Cold          2160063.            45.9           3.09    128.
## # ℹ 121 more rows
## # ℹ 2 more variables: avg_unemployment <dbl>, observation_count <int>

Step 7: Correlation analysis between variables

# Correlation analysis between variables
correlation_analysis <- walmart_clean %>%
  group_by(store) %>%
  summarise(
    cor_temp_sales = cor(temperature, weekly_sales),
    cor_fuel_sales = cor(fuel_price, weekly_sales),
    cor_cpi_sales = cor(cpi, weekly_sales),
    cor_unemp_sales = cor(unemployment, weekly_sales),
    .groups = 'drop'
  )
view(correlation_analysis)

# Convert to long format for heatmap
correlation_long <- correlation_analysis %>%
  pivot_longer(
    cols = -store,
    names_to = "variable",
    values_to = "correlation"
  ) %>%
  mutate(
    variable = case_when(
      variable == "cor_temp_sales" ~ "Temperature",
      variable == "cor_fuel_sales" ~ "Fuel Price",
      variable == "cor_cpi_sales" ~ "CPI",
      variable == "cor_unemp_sales" ~ "Unemployment",
      TRUE ~ variable
    )
  )

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)

# First, ensure variable names are properly formatted
significant_correlations <- correlation_long %>%
  filter(abs(correlation) > 0.3) %>%
  mutate(
    variable_clean = case_when(
      variable == "cor_temp_sales" ~ "Temperature",
      variable == "cor_fuel_sales" ~ "Fuel Price",
      variable == "cor_cpi_sales" ~ "CPI",
      variable == "cor_unemp_sales" ~ "Unemployment",
      TRUE ~ variable
    ),
    correlation_label = round(correlation, 3),
    significance_level = case_when(
      abs(correlation) > 0.7 ~ "Very Strong",
      abs(correlation) > 0.5 ~ "Strong", 
      abs(correlation) > 0.3 ~ "Moderate",
      TRUE ~ "Weak"
    )
  )

# Create interactive heatmap with proper x-axis
plot_ly(
  data = significant_correlations,
  x = ~variable_clean,  # Use cleaned variable names
  y = ~as.factor(store),
  z = ~correlation,
  type = "heatmap",
  colors = colorRamp(c("darkred", "white", "darkblue")),
  hoverinfo = "text",
  text = ~paste(
    "Store:", store,
    "<br>Variable:", variable_clean,
    "<br>Correlation:", correlation_label,
    "<br>Significance:", significance_level
  ),
  colorbar = list(
    title = "Correlation",
    tickvals = c(-1, -0.5, 0, 0.5, 1),
    ticktext = c("-1.0", "-0.5", "0", "0.5", "1.0")
  )
) %>%
  layout(
    title = list(
      text = "<b>Significant Correlations Only (|r| > 0.3)</b>",
      x = 0.5,
      font = list(size = 16)
    ),
    xaxis = list(
      title = "Economic Variables",
      tickangle = -45,
      categoryorder = "array",
      categoryarray = c("Temperature", "Fuel Price", "CPI", "Unemployment")  # Explicit order
    ),
    yaxis = list(
      title = "Store ID"
    ),
    margin = list(l = 80, r = 50, b = 100, t = 80),
    width = 800,
    height = 600
  )

## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

Based on the correlation heatmap analysis, here are the key findings:

Correlation Analysis Findings

1. Temperature Impact:

Temperature shows a consistent negative correlation with weekly sales across most stores As temperatures increase, sales tend to decrease, or vice versa suggesting potential seasonal patterns

2. CPI (Consumer Price Index) Relationship:

CPI demonstrates predominantly strong positive correlations with sales across the store network Store 36 stands out as a significant exception with a highly negative correlation (-0.92) This indicates most stores benefit from economic expansion, while Store 36 shows inverse behavior

3. Fuel Price Effects:

Fuel prices exhibit mixed correlations with sales performance Positive correlations observed in several stores, suggesting some locations benefit from or are resilient to fuel price changes Negative correlations identified in key locations:
Store 36 (most affected), Store 35, Store 31, Store 14 This indicates varied customer sensitivity to transportation costs across different store locations

4. Unemployment Rate Influence:

Unemployment rates show primarily negative correlations with sales Higher unemployment typically associates with reduced consumer spending Notable exceptions with strong positive correlations: Store 36, Store 35 These stores appear to perform better in higher unemployment environments, suggesting unique local market dynamics

Key Store Insights:
Store 36 demonstrates atypical behavior across multiple variables, warranting further investigation Stores show heterogeneous responses to economic factors, highlighting the importance of localized strategies The mixed correlations suggest customer demographics and local economic conditions significantly influence each store’s performance drivers These findings emphasize the need for store-specific strategies rather than one-size-fits-all approaches to sales optimization.

Step 8: Annual performance metrics for each store

# Calculate annual performance metrics for each store
store_year_metrics <- walmart_clean %>%
  # Group by store and year to calculate annual statistics
  group_by(store, year) %>%
  # Compute key performance indicators for each store-year combination
  summarise(
    # Total annual revenue across all weeks
    total_annual_sales = sum(weekly_sales),
    # Average weekly sales throughout the year
    mean_weekly_sales = mean(weekly_sales),
    # Median weekly sales (robust measure of typical performance)
    median_weekly_sales = median(weekly_sales),
    # Standard deviation of weekly sales (measures volatility/consistency)
    sales_volatility = sd(weekly_sales),
    # Number of weeks with data in the year
    weeks_recorded = n(),
    # Average temperature throughout the year
    mean_temperature = mean(temperature),
    # Average fuel price throughout the year
    mean_fuel_price = mean(fuel_price),
    # Average consumer price index throughout the year
    mean_cpi = mean(cpi),
    # Average unemployment rate throughout the year
    mean_unemployment = mean(unemployment),
    .groups = 'drop'
  ) %>%
  # Sort by store and year for better readability
  arrange(store, year)

# View the comprehensive annual metrics in interactive data viewer
view(store_year_metrics)

# Display first few rows in console for quick inspection
head(store_year_metrics)

## # A tibble: 6 × 11
##   store  year total_annual_sales mean_weekly_sales median_weekly_sales
##   <int> <dbl>              <dbl>             <dbl>               <dbl>
## 1     1  2010          73278832           1526642.            1494365.
## 2     1  2011          80921919.          1556191.            1537167.
## 3     1  2012          68202058.          1586094.            1582083.
## 4     2  2010          95277864.          1984956.            1927937.
## 5     2  2011          98607881.          1896305.            1838128.
## 6     2  2012          81496695.          1895272.            1898777.
## # ℹ 6 more variables: sales_volatility <dbl>, weeks_recorded <int>,
## #   mean_temperature <dbl>, mean_fuel_price <dbl>, mean_cpi <dbl>,
## #   mean_unemployment <dbl>

Step 9 Visualization:

9.1 Line plot showing sales trends

library(plotly)

# Line plot showing sales trends
sales_trend <- ggplot(store_year_metrics, 
                     aes(x = year, y = total_annual_sales, 
                         group = store, color = as.factor(store))) +
  geom_line(alpha = 0.7) +
  geom_point(size = 1) +
  labs(title = "Annual Sales Trends by Store (2010-2012)",
       x = "Year", y = "Total Annual Sales ($)",
       color = "Store ID") +
  theme_minimal() +
  scale_y_continuous(labels = scales::dollar)

# Interactive version
ggplotly(sales_trend)

Store Performance Trends Analysis

1. Declining Performance Stores:

Store 14 demonstrates a concerning downward trajectory with consistent sales degradation from 2010 through 2012 Store 35 shows a similar declining pattern, indicating potential operational challenges or market share loss in these locations

2. Consistently Underperforming Store:

Store 33 maintains the lowest annual sales performance across all three years (2010-2012) This store consistently ranks at the bottom of the performance spectrum, suggesting systemic issues requiring immediate intervention

9.2 Box plot showing sales distribution

sales_box <- ggplot(store_year_metrics, 
                   aes(x = as.factor(year), y = total_annual_sales)) +
  geom_boxplot(fill = "lightblue", alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.5, size = 1) +
  labs(title = "Annual Sales Distribution by Year",
       x = "Year", y = "Total Annual Sales ($)") +
  theme_minimal() +
  scale_y_continuous(labels = scales::dollar)

sales_box

Based on the Sales Distribution Box Plot Analysis:

2011 achieved peak annual sales performance across the store network
2010 maintained strong second-place results with consistent sales distribution
2012 experienced significant sales degradation with markedly lower total annual sales
The analysis reveals a concerning downward trend from 2011’s peak performance to 2012’s substantial decline, indicating potential market challenges

9.3 Heatmap of store performance across years

performance_heatmap <- ggplot(store_year_metrics, 
                             aes(x = as.factor(year), y = as.factor(store), 
                                 fill = total_annual_sales)) +
  geom_tile(color = "white", linewidth = 0.5) +
  scale_fill_viridis_c(
    labels = scales::dollar, 
    name = "Annual Sales",
    option = "plasma"  # Better color contrast
  ) +
  labs(
    title = "Walmart Store Performance Heatmap (2010-2012)",
    subtitle = "Annual Sales Distribution Across 45 Stores",
    x = "Year", 
    y = "Store ID"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, size = 10),
    axis.text.y = element_text(size = 8),  # Smaller font for store labels
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 11),
    panel.grid = element_blank(),
    legend.position = "right",
    plot.margin = margin(0.0000005, .0000005, .000005, 0.000005, "cm")  # Increased margins
  ) +
  # Add value labels on tiles for better readability
  geom_text(aes(label = scales::dollar(total_annual_sales, scale = 1e-6, suffix = "M")), 
            size = 2, color = "white", fontface = "bold")

# Display with increased dimensions
performance_heatmap

# Save with specific dimensions
ggsave("store_performance_heatmap.png", performance_heatmap, 
       width = 30, height = 29, units = "in", dpi = 500)

#### 9.4 Average vs Standard Deviation for Sales

volatility_plot <- ggplot(store_year_metrics, 
                         aes(x = mean_weekly_sales, y = sales_volatility,
                             size = total_annual_sales, color = as.factor(store))) +
  geom_point(alpha = 0.7) +
  labs(title = "Sales Consistency Analysis: Average vs Volatility",
       x = "Mean Weekly Sales ($)", 
       y = "Sales Volatility (Standard Deviation)",
       size = "Total Annual Sales", color = "Store ID") +
  theme_minimal() +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  guides(color = "none")  # Remove store color legend for clarity

# Interactive version
ggplotly(volatility_plot)

Step 10: Calculate year-over-year growth for each store

#Calculate year-over-year growth for each store
store_growth_analysis <- store_year_metrics %>%
  # Group by store to calculate growth within each store
  group_by(store) %>%
  # Sort by year to ensure proper sequence
  arrange(year) %>%
  mutate(
    # Calculate year-over-year sales growth percentage
    yoy_sales_growth = (total_annual_sales / lag(total_annual_sales) - 1) * 100,
    # Calculate year-over-year weekly sales growth
    yoy_weekly_growth = (mean_weekly_sales / lag(mean_weekly_sales) - 1) * 100
  ) %>%
  ungroup()
  store_growth_analysis_data <- store_growth_analysis %>%
    select(store, year, yoy_sales_growth,yoy_weekly_growth)%>%
    filter(year %in% c(2011, 2012))

# View the growth analysis
view(store_growth_analysis_data)

#Year-over-Year Sales Growth by Store
growth_plot <- ggplot(store_growth_analysis_data, 
                     aes(x = as.factor(store), y = yoy_sales_growth, fill = yoy_sales_growth > 0)) +
  geom_col() +
  labs(title = "Year-over-Year Sales Growth by Store",
       x = "Store ID", y = "Year-over-Year Growth (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_manual(values = c("TRUE" = "darkgreen", "FALSE" = "darkgrey"),
                    name = "Positive Growth")

growth_plot

### Year-over-Year Sales Growth Analysis

1. Significant Sales Decline Stores:

Stores 35 & 36 experienced a drastic sales deterioration with consecutive year-over-year declines from 2010 through 2012
Stores 13 & 14 showed a sharp sales reduction in 2012, contributing substantially to the overall annual performance drop
2. Positive Growth Performers:
Store 7 demonstrated notable sales improvement from 2010 to 2011, indicating successful growth strategies
Store 38 achieved consistent sales growth from 2010 to 2011, showing positive momentum
3. Overall Performance Context:

The pronounced declines in Stores 13, 14, 35, and 36 directly contributed to 2012’s lower aggregate sales observed in the box plot analysis

Summary

This comprehensive analysis of Walmart’s sales data from 2010-2012 across 45 stores reveals significant insights into retail performance drivers, seasonal patterns, and economic influences. The study demonstrates that sales performance is multifaceted, influenced by both internal operational factors and external economic conditions.

Key Findings

1. Sales Performance Patterns

Peak Performance: 2011 represented the strongest sales year across the store network Concerning Decline: 2012 showed significant sales degradation, with a 7.6% average decrease from 2011 levels Store Variability: Performance varied dramatically across locations, with top-performing stores generating 3-4x more revenue than bottom performers

2. Critical Underperforming Stores

Persistent Issues: Stores 33, 14, 35, and 36 demonstrated consistent performance challenges Store 33: Consistently ranked lowest in annual sales across all three years Stores 14 & 35: Showed progressive sales deterioration from 2010-2012 Store 36: Exhibited atypical correlation patterns with economic indicators

3. Economic Factor Correlations

Temperature: Generally negative correlation with sales (-0.2 to -0.4 range), suggesting seasonal shopping patterns CPI: Mixed impacts with predominantly positive correlations, except Store 36 showing strong negative correlation (-0.92) Fuel Prices: Varied effects across stores, indicating regional sensitivity differences Unemployment: Primarily negative correlations, though Stores 35 and 36 showed counter-intuitive positive relationships

4. Seasonal and Holiday Impacts

Holiday Premium: Average 15-25% sales increase during holiday weeks across most stores Temperature Segments: “Cold” weather periods (30-60°F) showed highest sales consistency Quarterly Patterns: Q4 consistently strongest performer due to holiday season

Limitations

Temporal Scope: Limited to 2010-2012 data, missing recent trends and COVID-era impacts Geographic Granularity: Store-level analysis without regional clustering External Factors: Limited incorporation of local competition, demographic changes, or marketing initiatives Data Completeness: Data coverage analysis reveals the sales decline in 2012 Q4 is attributable to incomplete data rather than performance issues. While 2011 has complete 52-week records, 2012 contains only 43 weeks of data, creating artificial declines in the quarterly analysis.

Future work:

Machine Learning: Implement predictive models for sales forecasting Advanced Visualization Development, Interactive Dashboard Geospatial Analysis: Map store performance against regional economic indicators

Conclusion

This analysis demonstrates that Walmart’s sales performance is influenced by a complex interplay of store-specific factors, economic conditions, and seasonal patterns. The findings highlight both the resilience of the retail operation and significant opportunities for targeted improvements. By leveraging these insights and pursuing the recommended future work, Walmart can enhance its strategic decision-making, optimize store performance, and maintain competitive advantage in the evolving retail landscape.

The most significant opportunity lies in moving from reactive analysis to predictive, prescriptive analytics that can anticipate performance challenges and recommend specific interventions before they impact revenue.