This is a Walmart sales data available on kaggle. We will study the sales data of one of the largest retailers in the world. Let’s figure out what factors influence its revenue. Can factors such as air temperature and fuel cost influence the success of a huge company along with the purchasing power index and Unemployment rate?
The data contains the following columns:
Store: Store
number
Date: Sales week start date
Weekly_Sales:
Sales
Holiday_Flag: Mark on the presence or absence of a
holiday
Temperature: Air temperature in the region
Fuel_Price: Fuel cost in the region
CPI: Consumer
price index
Unemployment: Unemployment rate
library(tidyverse) #include dplyr, tidyr, ggplot2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url = 'https://storage.googleapis.com/kagglesdsdata/datasets/4438189/8620416/Walmart_Sales.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20251003%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251003T151856Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=180c63c0d5ae7c535cb4862d6966d553b1623464927b895695ea1a76ec7ac151223542a134db86ba769f600ea64aea9a49886e85f225b13396f59418f442f95ad6bb614e47523cb595cdcf673e5757f7d4766ecc240878999b8679032898d752dae210d7a19cd8bf7998472ddcebeecb3d2fbaa14b38339700e4867fc097ea007354558357f402cd37e8020e0dbdd3a7aba364d23e52d3c4271819cd17e1b3fa7fa100394572d26e96f7e60b182ac6588fa35fce3613d289a6c91318b0f43bb0461b0f0de7606f4652e7843af08e1b0133481ff1a620e5a998f50e903b535f7fa166e11cd301188007f9267ed3f7830864e37b57ea25adbff171e70c9c15b7a5'
walmart_sales <- read.csv(url) #read csv from Kaggle url
cat("Dataset Structure:\n")
## Dataset Structure:
str(walmart_sales)
## 'data.frame': 6435 obs. of 8 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr "05-02-2010" "12-02-2010" "19-02-2010" "26-02-2010" ...
## $ Weekly_Sales: num 1643691 1641957 1611968 1409728 1554807 ...
## $ Holiday_Flag: int 0 1 0 0 0 0 0 0 0 0 ...
## $ Temperature : num 42.3 38.5 39.9 46.6 46.5 ...
## $ Fuel_Price : num 2.57 2.55 2.51 2.56 2.62 ...
## $ CPI : num 211 211 211 211 211 ...
## $ Unemployment: num 8.11 8.11 8.11 8.11 8.11 ...
cat("\nFirst few rows:\n")
##
## First few rows:
head(walmart_sales)
## Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI
## 1 1 05-02-2010 1643691 0 42.31 2.572 211.0964
## 2 1 12-02-2010 1641957 1 38.51 2.548 211.2422
## 3 1 19-02-2010 1611968 0 39.93 2.514 211.2891
## 4 1 26-02-2010 1409728 0 46.63 2.561 211.3196
## 5 1 05-03-2010 1554807 0 46.50 2.625 211.3501
## 6 1 12-03-2010 1439542 0 57.79 2.667 211.3806
## Unemployment
## 1 8.106
## 2 8.106
## 3 8.106
## 4 8.106
## 5 8.106
## 6 8.106
cat("\nSummary statistics:\n")
##
## Summary statistics:
summary(walmart_sales)
## Store Date Weekly_Sales Holiday_Flag
## Min. : 1 Length:6435 Min. : 209986 Min. :0.00000
## 1st Qu.:12 Class :character 1st Qu.: 553350 1st Qu.:0.00000
## Median :23 Mode :character Median : 960746 Median :0.00000
## Mean :23 Mean :1046965 Mean :0.06993
## 3rd Qu.:34 3rd Qu.:1420159 3rd Qu.:0.00000
## Max. :45 Max. :3818686 Max. :1.00000
## Temperature Fuel_Price CPI Unemployment
## Min. : -2.06 Min. :2.472 Min. :126.1 Min. : 3.879
## 1st Qu.: 47.46 1st Qu.:2.933 1st Qu.:131.7 1st Qu.: 6.891
## Median : 62.67 Median :3.445 Median :182.6 Median : 7.874
## Mean : 60.66 Mean :3.359 Mean :171.6 Mean : 7.999
## 3rd Qu.: 74.94 3rd Qu.:3.735 3rd Qu.:212.7 3rd Qu.: 8.622
## Max. :100.14 Max. :4.468 Max. :227.2 Max. :14.313
Store Coverage: 45 distinct retail locations
Temporal
Features:
Date field (currently character format - requires
conversion to proper date type)
Holiday indicator (binary: 0 =
regular day, 1 = holiday)
Economic Metrics:
Fuel prices
ranging from $2.50 to $4.50
Weekly sales data Wide spread from
$209K to $3.82M
Consumer prices vary widely (126-227 CPI range)
with a central tendency around 183,
indicating generally elevated
but inconsistent economic conditions across stores.
Environmental
Factor:
Temperature range: -2°F to 100°F with mean of 60.7°F
The unemployment data shows considerable disparity (3.9%-14.3%),
with a central tendency near 8%,
reflecting diverse economic health
across different store markets
# Clean and transform the data
walmart_clean <- walmart_sales %>%
# Convert date to proper format and extract useful time features
mutate(
Date = as.Date(Date, format = "%d-%m-%Y"),
year = year(Date),
month = month(Date, label = TRUE),
week = week(Date),
day_of_week = wday(Date, label = TRUE),
quarter = quarter(Date),
# Convert categorical variables to factors
Holiday_Flag = factor(Holiday_Flag, levels = c(0, 1), labels = c("non_holiday", "holiday")),
# Create sales categories
weekly_sales_category = case_when(
Weekly_Sales < 1000000 ~ "Low",
Weekly_Sales >= 1000000 & Weekly_Sales < 2000000 ~ "Medium",
Weekly_Sales >= 2000000 ~ "High"
),
# Create temperature segments
temp_segment = cut(Temperature,
breaks = c(-Inf, 30, 60, 90, Inf),
labels = c("Very Cold", "Cold", "Moderate", "Hot"))
) %>%
# Remove any potential duplicates
distinct() %>%
# Handle missing values if any
drop_na()
# dplyr method to convert all column names to lower case letters
walmart_clean <- walmart_clean %>%
rename_all(tolower)
cat("After cleaning - Dataset structure:\n")
## After cleaning - Dataset structure:
glimpse(walmart_clean)
## Rows: 6,435
## Columns: 15
## $ store <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ date <date> 2010-02-05, 2010-02-12, 2010-02-19, 2010-02-26,…
## $ weekly_sales <dbl> 1643691, 1641957, 1611968, 1409728, 1554807, 143…
## $ holiday_flag <fct> non_holiday, holiday, non_holiday, non_holiday, …
## $ temperature <dbl> 42.31, 38.51, 39.93, 46.63, 46.50, 57.79, 54.58,…
## $ fuel_price <dbl> 2.572, 2.548, 2.514, 2.561, 2.625, 2.667, 2.720,…
## $ cpi <dbl> 211.0964, 211.2422, 211.2891, 211.3196, 211.3501…
## $ unemployment <dbl> 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106,…
## $ year <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
## $ month <ord> Feb, Feb, Feb, Feb, Mar, Mar, Mar, Mar, Apr, Apr…
## $ week <dbl> 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ day_of_week <ord> Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri…
## $ quarter <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ weekly_sales_category <chr> "Medium", "Medium", "Medium", "Medium", "Medium"…
## $ temp_segment <fct> Cold, Cold, Cold, Cold, Cold, Cold, Cold, Cold, …
unique(walmart_clean$year)
## [1] 2010 2011 2012
dim(walmart_clean)
## [1] 6435 15
Data dimension: 6435 rows with 15 columns Data is from 2010, 2011 and 2012
# Create a wide format for store comparisons
store_performance_wide <- walmart_clean %>%
group_by(store, year, quarter) %>%
summarise(
total_sales = sum(weekly_sales),
avg_sales = mean(weekly_sales),
max_sales = max(weekly_sales),
min_sales = min(weekly_sales),
.groups = 'drop'
) %>%
pivot_wider(
names_from = quarter,
values_from = c(total_sales, avg_sales, max_sales, min_sales),
names_sep = "_Q"
)
cat("Wide format - Store performance by quarter:\n")
## Wide format - Store performance by quarter:
glimpse(store_performance_wide)
## Rows: 135
## Columns: 18
## $ store <int> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7…
## $ year <dbl> 2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2012, 2…
## $ total_sales_Q1 <dbl> 12178638, 18187314, 20723763, 15942113, 22490219, 24528…
## $ total_sales_Q2 <dbl> 19436822, 20012527, 20978760, 25367304, 23905955, 25083…
## $ total_sales_Q3 <dbl> 19150230, 20752182, 20253948, 24294571, 24852063, 24303…
## $ total_sales_Q4 <dbl> 22513142, 21969895, 6245587, 29673876, 27359644, 758151…
## $ avg_sales_Q1 <dbl> 1522329.8, 1515609.5, 1594135.6, 1992764.2, 1874185.0, …
## $ avg_sales_Q2 <dbl> 1495140.2, 1539425.2, 1613750.8, 1951331.1, 1838919.6, …
## $ avg_sales_Q3 <dbl> 1473094.6, 1482298.7, 1557996.0, 1868813.2, 1775147.3, …
## $ avg_sales_Q4 <dbl> 1608081.6, 1689991.9, 1561396.8, 2119562.5, 2104588.0, …
## $ max_sales_Q1 <dbl> 1643690.9, 1686842.8, 1819870.0, 2137809.5, 2168041.6, …
## $ max_sales_Q2 <dbl> 1615524.7, 1635078.4, 1899676.9, 2102539.9, 1953772.0, …
## $ max_sales_Q3 <dbl> 1605491.8, 1624383.8, 1769854.2, 2003940.6, 1876704.3, …
## $ max_sales_Q4 <dbl> 2387950.2, 2270189.0, 1670786.0, 3436007.7, 3224369.8, …
## $ min_sales_Q1 <dbl> 1404429.9, 1316899.3, 1319325.6, 1750197.8, 1695371.7, …
## $ min_sales_Q2 <dbl> 1391256.1, 1428218.3, 1468928.4, 1802450.3, 1688281.9, …
## $ min_sales_Q3 <dbl> 1351791.0, 1352219.8, 1437059.3, 1724557.2, 1650394.4, …
## $ min_sales_Q4 <dbl> 1345454.0, 1445249.1, 1493659.7, 1737947.6, 1743882.2, …
# Create quarterly_sales_long first
quarterly_sales_long <- store_performance_wide %>%
select(store, year, contains("total_sales")) %>%
pivot_longer(
cols = contains("total_sales"),
names_to = "quarter",
values_to = "sales",
names_prefix = "total_sales_Q"
) %>%
mutate(quarter = paste0("Q", quarter))
# Better way to find top stores (using total sales across all quarters)
top_stores <- quarterly_sales_long %>%
group_by(store) %>%
summarise(total_sales = sum(sales, na.rm = TRUE)) %>%
slice_max(total_sales, n = 8) %>%
pull(store)
# Create the plot
quarterly_trends <- quarterly_sales_long %>%
filter(store %in% top_stores)
ggplot(quarterly_trends, aes(x = quarter, y = sales, group = store, color = factor(store))) +
geom_line(size = 1) +
geom_point(size = 2) +
facet_wrap(~ year) +
labs(
title = "Quarterly Sales Trends - Top 8 Performing Stores",
x = "Quarter",
y = "Quarterly Sales",
color = "Store ID"
) +
theme_minimal() +
scale_y_continuous(labels = scales::dollar)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Q4 Performance: 2010 and 2011 showed peak sales in Q4, while
2012 Q4 dropped to the lowest level.
2011 Trend: Consistent quarter-over-quarter sales growth
throughout the year.
2010 Pattern: Q1 started low, increased in Q2, and peaked in
Q4.
2012 Anomaly: Moderate sales in Q1-Q3 followed by a sharp Q4
decline.
To assess data completeness, we need to verify the number of weeks recorded per store each year. The observed sales decline in 2012 Q4 may stem from incomplete data coverage rather than actual performance issues.
weeks_per_store_year <- walmart_clean %>%
group_by(store, year) %>%
summarise(
total_weeks = n(),
unique_weeks = n_distinct(week),
.groups = 'drop'
)
# View the results
view(weeks_per_store_year)
print(weeks_per_store_year, n = 20)
## # A tibble: 135 × 4
## store year total_weeks unique_weeks
## <int> <dbl> <int> <int>
## 1 1 2010 48 48
## 2 1 2011 52 52
## 3 1 2012 43 43
## 4 2 2010 48 48
## 5 2 2011 52 52
## 6 2 2012 43 43
## 7 3 2010 48 48
## 8 3 2011 52 52
## 9 3 2012 43 43
## 10 4 2010 48 48
## 11 4 2011 52 52
## 12 4 2012 43 43
## 13 5 2010 48 48
## 14 5 2011 52 52
## 15 5 2012 43 43
## 16 6 2010 48 48
## 17 6 2011 52 52
## 18 6 2012 43 43
## 19 7 2010 48 48
## 20 7 2011 52 52
## # ℹ 115 more rows
Data coverage analysis reveals the sales decline in 2012 Q4 is attributable to incomplete data rather than performance issues. While 2011 has complete 52-week records, 2012 contains only 43 weeks of data, creating artificial declines in the quarterly analysis.
# Create multiple transformed views of the data
store_metrics <- walmart_clean %>%
group_by(store, year) %>%
summarise(
#Total annual sales for each store-year combination
total_revenue = sum(weekly_sales),
#Average weekly sales performance
#Helps compare stores of different sizes on a normalized basis
avg_weekly_sales = mean(weekly_sales),
#Standard deviation of weekly sales
#Measures how consistent or unpredictable sales are throughout the year
sales_volatility = sd(weekly_sales),
#Compares average sales during holiday weeks vs non-holiday weeks
holiday_boost = mean(weekly_sales[holiday_flag == "holiday"]) /
mean(weekly_sales[holiday_flag == "non_holiday"]),
best_month = month[which.max(weekly_sales)][1],
worst_month = month[which.min(weekly_sales)][1],
observation_count = n(),
.groups = 'drop'
) %>%
mutate(
#Divides all stores into 4 equal groups (quartiles) based on avg_weekly_sales
#Each store gets a number 1-4 representing which quartile they belong to
performance_tier = ntile(avg_weekly_sales, 4),
performance_label = case_when(
performance_tier == 1 ~ "Low",
performance_tier == 2 ~ "Medium",
performance_tier == 3 ~ "High",
performance_tier == 4 ~ "Elite"
)
)
view(store_metrics)
Holiday performance ratio
Values > 1 = stores perform better during holidays Values < 1 = stores perform worse during holidays High volatility = unpredictable sales patterns
# Load the zoo package for rolling mean calculations
library('zoo')
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# Create time series features for sales analysis
time_series_data <- walmart_clean %>%
# Sort data chronologically for each store to ensure proper time sequence
arrange(store, year, month, week) %>%
# Group by store to calculate metrics within each store separately
group_by(store) %>%
# Create new time-based metrics for each store
mutate(
# Get previous week's sales for comparison (1-week lag)
sales_lag1 = lag(weekly_sales, 1),
# Calculate week-over-week sales growth percentage
sales_growth = (weekly_sales / sales_lag1 - 1) * 100,
# Compute 4-week moving average to smooth out weekly fluctuations
rolling_avg_4wk = zoo::rollmean(weekly_sales, 4, fill = NA, align = 'right')
) %>%
# Remove grouping to return to normal dataframe structure
ungroup()
# Select only the key columns for analysis and reporting
selected_data <- time_series_data %>%
select(store, year, month, week, weekly_sales, sales_lag1, sales_growth, rolling_avg_4wk)
# Open interactive data viewer to explore the transformed dataset
view(selected_data)
# Display first few rows in console for quick inspection
head(selected_data)
## # A tibble: 6 × 8
## store year month week weekly_sales sales_lag1 sales_growth rolling_avg_4wk
## <int> <dbl> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2010 Feb 6 1643691. NA NA NA
## 2 1 2010 Feb 7 1641957. 1643691. -0.105 NA
## 3 1 2010 Feb 8 1611968. 1641957. -1.83 NA
## 4 1 2010 Feb 9 1409728. 1611968. -12.5 1576836.
## 5 1 2010 Mar 10 1554807. 1409728. 10.3 1554615.
## 6 1 2010 Mar 11 1439542. 1554807. -7.41 1504011.
# monthly analysis (aggregate weekly data into monthly totals first)
monthly_analysis <- walmart_clean %>%
# Group by store and month to create monthly aggregates
group_by(store, year, month) %>%
# Calculate monthly totals from weekly data
summarise(
monthly_sales = sum(weekly_sales),
.groups = 'drop'
) %>%
# Sort chronologically
arrange(store, year, month) %>%
# Group by store for time series calculations
group_by(store) %>%
mutate(
# Get previous month's total sales
monthly_sales_lag1 = lag(monthly_sales, 1),
# Calculate month-over-month growth
monthly_growth = (monthly_sales / monthly_sales_lag1 - 1) * 100,
# Compute 3-month rolling average of monthly totals
rolling_avg_3month = zoo::rollmean(monthly_sales, 3, fill = NA, align = 'right')
) %>%
ungroup()
# View monthly analysis
view(monthly_analysis)
head(monthly_analysis)
## # A tibble: 6 × 7
## store year month monthly_sales monthly_sales_lag1 monthly_growth
## <int> <dbl> <ord> <dbl> <dbl> <dbl>
## 1 1 2010 Feb 6307344. NA NA
## 2 1 2010 Mar 5871294. 6307344. -6.91
## 3 1 2010 Apr 7422802. 5871294. 26.4
## 4 1 2010 May 5929939. 7422802. -20.1
## 5 1 2010 Jun 6084081. 5929939. 2.60
## 6 1 2010 Jul 7244483. 6084081. 19.1
## # ℹ 1 more variable: rolling_avg_3month <dbl>
holiday_analysis <- walmart_clean %>%
group_by(store, holiday_flag) %>%
summarise(
avg_sales = mean(weekly_sales),
median_sales = median(weekly_sales),
sales_count = n(),
.groups = 'drop'
) %>%
group_by(store) %>%
arrange(store, holiday_flag) %>% # Sort by store and holiday_flag (0 then 1)
mutate(
holiday_premium = (avg_sales / lag(avg_sales) - 1) * 100
# Calculates percentage difference:
# For holiday_flag=1 row: (holiday_sales / previous_non_holiday_sales - 1) * 100
# Shows how much higher/lower holiday sales are compared to non-holiday sales
) %>%
filter(!is.na(holiday_premium)) # Keep only rows where calculation worked (holiday_flag=1 rows)
view(holiday_analysis)
Final filter keeps only holiday rows with the calculated premium Example: If non-holiday sales = $100 and holiday sales = $125:
(125 / 100 - 1) * 100 = 25% holiday premium
# Analyze external factors
external_factors_analysis <- walmart_clean %>%
group_by(store, temp_segment) %>%
summarise(
avg_sales = mean(weekly_sales),
avg_temperature = mean(temperature),
avg_fuel_price = mean(fuel_price),
avg_cpi = mean(cpi),
avg_unemployment = mean(unemployment),
observation_count = n(),
.groups = 'drop'
) %>%
arrange(store, avg_temperature)
external_factors_analysis
## # A tibble: 131 × 8
## store temp_segment avg_sales avg_temperature avg_fuel_price avg_cpi
## <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 Cold 1600360. 49.9 3.04 215.
## 2 1 Moderate 1535739. 75.6 3.29 216.
## 3 1 Hot 1574765. 91.2 3.66 216.
## 4 2 Cold 2043379. 49.0 3.05 215.
## 5 2 Moderate 1875919. 76.0 3.28 216.
## 6 2 Hot 1827482. 91.1 3.59 217.
## 7 3 Cold 431030. 52.3 3.00 218.
## 8 3 Moderate 394864. 76.7 3.28 220.
## 9 4 Very Cold 2188307. 28.8 2.57 126.
## 10 4 Cold 2160063. 45.9 3.09 128.
## # ℹ 121 more rows
## # ℹ 2 more variables: avg_unemployment <dbl>, observation_count <int>
# Correlation analysis between variables
correlation_analysis <- walmart_clean %>%
group_by(store) %>%
summarise(
cor_temp_sales = cor(temperature, weekly_sales),
cor_fuel_sales = cor(fuel_price, weekly_sales),
cor_cpi_sales = cor(cpi, weekly_sales),
cor_unemp_sales = cor(unemployment, weekly_sales),
.groups = 'drop'
)
view(correlation_analysis)
# Convert to long format for heatmap
correlation_long <- correlation_analysis %>%
pivot_longer(
cols = -store,
names_to = "variable",
values_to = "correlation"
) %>%
mutate(
variable = case_when(
variable == "cor_temp_sales" ~ "Temperature",
variable == "cor_fuel_sales" ~ "Fuel Price",
variable == "cor_cpi_sales" ~ "CPI",
variable == "cor_unemp_sales" ~ "Unemployment",
TRUE ~ variable
)
)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
# First, ensure variable names are properly formatted
significant_correlations <- correlation_long %>%
filter(abs(correlation) > 0.3) %>%
mutate(
variable_clean = case_when(
variable == "cor_temp_sales" ~ "Temperature",
variable == "cor_fuel_sales" ~ "Fuel Price",
variable == "cor_cpi_sales" ~ "CPI",
variable == "cor_unemp_sales" ~ "Unemployment",
TRUE ~ variable
),
correlation_label = round(correlation, 3),
significance_level = case_when(
abs(correlation) > 0.7 ~ "Very Strong",
abs(correlation) > 0.5 ~ "Strong",
abs(correlation) > 0.3 ~ "Moderate",
TRUE ~ "Weak"
)
)
# Create interactive heatmap with proper x-axis
plot_ly(
data = significant_correlations,
x = ~variable_clean, # Use cleaned variable names
y = ~as.factor(store),
z = ~correlation,
type = "heatmap",
colors = colorRamp(c("darkred", "white", "darkblue")),
hoverinfo = "text",
text = ~paste(
"Store:", store,
"<br>Variable:", variable_clean,
"<br>Correlation:", correlation_label,
"<br>Significance:", significance_level
),
colorbar = list(
title = "Correlation",
tickvals = c(-1, -0.5, 0, 0.5, 1),
ticktext = c("-1.0", "-0.5", "0", "0.5", "1.0")
)
) %>%
layout(
title = list(
text = "<b>Significant Correlations Only (|r| > 0.3)</b>",
x = 0.5,
font = list(size = 16)
),
xaxis = list(
title = "Economic Variables",
tickangle = -45,
categoryorder = "array",
categoryarray = c("Temperature", "Fuel Price", "CPI", "Unemployment") # Explicit order
),
yaxis = list(
title = "Store ID"
),
margin = list(l = 80, r = 50, b = 100, t = 80),
width = 800,
height = 600
)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
Based on the correlation heatmap
analysis, here are the key findings:
1. Temperature Impact:
Temperature shows a consistent negative correlation with weekly sales across most stores As temperatures increase, sales tend to decrease, or vice versa suggesting potential seasonal patterns
2. CPI (Consumer Price Index) Relationship:
CPI demonstrates predominantly strong positive correlations with
sales across the store network Store 36 stands out as a significant
exception with a highly negative correlation (-0.92) This indicates most
stores benefit from economic expansion, while Store 36 shows inverse
behavior
3. Fuel Price Effects:
Fuel prices exhibit mixed correlations with sales performance
Positive correlations observed in several stores, suggesting some
locations benefit from or are resilient to fuel price changes
Negative correlations identified in key locations:
Store 36
(most affected), Store 35, Store 31, Store 14 This indicates varied
customer sensitivity to transportation costs across different store
locations
4. Unemployment Rate Influence:
Unemployment rates show primarily negative correlations with sales Higher unemployment typically associates with reduced consumer spending Notable exceptions with strong positive correlations: Store 36, Store 35 These stores appear to perform better in higher unemployment environments, suggesting unique local market dynamics
Key Store Insights:
Store 36 demonstrates atypical
behavior across multiple variables, warranting further investigation
Stores show heterogeneous responses to economic factors, highlighting
the importance of localized strategies The mixed correlations suggest
customer demographics and local economic conditions significantly
influence each store’s performance drivers These findings emphasize the
need for store-specific strategies rather than one-size-fits-all
approaches to sales optimization.
# Calculate annual performance metrics for each store
store_year_metrics <- walmart_clean %>%
# Group by store and year to calculate annual statistics
group_by(store, year) %>%
# Compute key performance indicators for each store-year combination
summarise(
# Total annual revenue across all weeks
total_annual_sales = sum(weekly_sales),
# Average weekly sales throughout the year
mean_weekly_sales = mean(weekly_sales),
# Median weekly sales (robust measure of typical performance)
median_weekly_sales = median(weekly_sales),
# Standard deviation of weekly sales (measures volatility/consistency)
sales_volatility = sd(weekly_sales),
# Number of weeks with data in the year
weeks_recorded = n(),
# Average temperature throughout the year
mean_temperature = mean(temperature),
# Average fuel price throughout the year
mean_fuel_price = mean(fuel_price),
# Average consumer price index throughout the year
mean_cpi = mean(cpi),
# Average unemployment rate throughout the year
mean_unemployment = mean(unemployment),
.groups = 'drop'
) %>%
# Sort by store and year for better readability
arrange(store, year)
# View the comprehensive annual metrics in interactive data viewer
view(store_year_metrics)
# Display first few rows in console for quick inspection
head(store_year_metrics)
## # A tibble: 6 × 11
## store year total_annual_sales mean_weekly_sales median_weekly_sales
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 2010 73278832 1526642. 1494365.
## 2 1 2011 80921919. 1556191. 1537167.
## 3 1 2012 68202058. 1586094. 1582083.
## 4 2 2010 95277864. 1984956. 1927937.
## 5 2 2011 98607881. 1896305. 1838128.
## 6 2 2012 81496695. 1895272. 1898777.
## # ℹ 6 more variables: sales_volatility <dbl>, weeks_recorded <int>,
## # mean_temperature <dbl>, mean_fuel_price <dbl>, mean_cpi <dbl>,
## # mean_unemployment <dbl>
library(plotly)
# Line plot showing sales trends
sales_trend <- ggplot(store_year_metrics,
aes(x = year, y = total_annual_sales,
group = store, color = as.factor(store))) +
geom_line(alpha = 0.7) +
geom_point(size = 1) +
labs(title = "Annual Sales Trends by Store (2010-2012)",
x = "Year", y = "Total Annual Sales ($)",
color = "Store ID") +
theme_minimal() +
scale_y_continuous(labels = scales::dollar)
# Interactive version
ggplotly(sales_trend)
1. Declining Performance Stores:
Store 14 demonstrates a concerning downward trajectory with
consistent sales degradation from 2010 through 2012 Store 35 shows a
similar declining pattern, indicating potential operational challenges
or market share loss in these locations
2. Consistently Underperforming Store:
Store 33 maintains the lowest annual sales performance across all three years (2010-2012) This store consistently ranks at the bottom of the performance spectrum, suggesting systemic issues requiring immediate intervention
sales_box <- ggplot(store_year_metrics,
aes(x = as.factor(year), y = total_annual_sales)) +
geom_boxplot(fill = "lightblue", alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.5, size = 1) +
labs(title = "Annual Sales Distribution by Year",
x = "Year", y = "Total Annual Sales ($)") +
theme_minimal() +
scale_y_continuous(labels = scales::dollar)
sales_box
performance_heatmap <- ggplot(store_year_metrics,
aes(x = as.factor(year), y = as.factor(store),
fill = total_annual_sales)) +
geom_tile(color = "white", linewidth = 0.5) +
scale_fill_viridis_c(
labels = scales::dollar,
name = "Annual Sales",
option = "plasma" # Better color contrast
) +
labs(
title = "Walmart Store Performance Heatmap (2010-2012)",
subtitle = "Annual Sales Distribution Across 45 Stores",
x = "Year",
y = "Store ID"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, size = 10),
axis.text.y = element_text(size = 8), # Smaller font for store labels
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
plot.subtitle = element_text(hjust = 0.5, size = 11),
panel.grid = element_blank(),
legend.position = "right",
plot.margin = margin(0.0000005, .0000005, .000005, 0.000005, "cm") # Increased margins
) +
# Add value labels on tiles for better readability
geom_text(aes(label = scales::dollar(total_annual_sales, scale = 1e-6, suffix = "M")),
size = 2, color = "white", fontface = "bold")
# Display with increased dimensions
performance_heatmap
# Save with specific dimensions
ggsave("store_performance_heatmap.png", performance_heatmap,
width = 30, height = 29, units = "in", dpi = 500)
#### 9.4 Average vs Standard Deviation for Sales
volatility_plot <- ggplot(store_year_metrics,
aes(x = mean_weekly_sales, y = sales_volatility,
size = total_annual_sales, color = as.factor(store))) +
geom_point(alpha = 0.7) +
labs(title = "Sales Consistency Analysis: Average vs Volatility",
x = "Mean Weekly Sales ($)",
y = "Sales Volatility (Standard Deviation)",
size = "Total Annual Sales", color = "Store ID") +
theme_minimal() +
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::dollar) +
guides(color = "none") # Remove store color legend for clarity
# Interactive version
ggplotly(volatility_plot)
#Calculate year-over-year growth for each store
store_growth_analysis <- store_year_metrics %>%
# Group by store to calculate growth within each store
group_by(store) %>%
# Sort by year to ensure proper sequence
arrange(year) %>%
mutate(
# Calculate year-over-year sales growth percentage
yoy_sales_growth = (total_annual_sales / lag(total_annual_sales) - 1) * 100,
# Calculate year-over-year weekly sales growth
yoy_weekly_growth = (mean_weekly_sales / lag(mean_weekly_sales) - 1) * 100
) %>%
ungroup()
store_growth_analysis_data <- store_growth_analysis %>%
select(store, year, yoy_sales_growth,yoy_weekly_growth)%>%
filter(year %in% c(2011, 2012))
# View the growth analysis
view(store_growth_analysis_data)
#Year-over-Year Sales Growth by Store
growth_plot <- ggplot(store_growth_analysis_data,
aes(x = as.factor(store), y = yoy_sales_growth, fill = yoy_sales_growth > 0)) +
geom_col() +
labs(title = "Year-over-Year Sales Growth by Store",
x = "Store ID", y = "Year-over-Year Growth (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_manual(values = c("TRUE" = "darkgreen", "FALSE" = "darkgrey"),
name = "Positive Growth")
growth_plot
### Year-over-Year Sales Growth Analysis
1. Significant Sales Decline Stores:
Stores 35 & 36 experienced a drastic sales deterioration with consecutive year-over-year declines from 2010 through 2012
Stores 13 & 14 showed a sharp sales reduction in 2012,
contributing substantially to the overall annual performance drop
2. Positive Growth Performers:
Store 7 demonstrated notable sales improvement from 2010 to 2011, indicating successful growth strategies
Store 38 achieved consistent sales growth from 2010 to 2011,
showing positive momentum
3. Overall Performance
Context:
The pronounced declines in Stores 13, 14, 35, and 36 directly contributed to 2012’s lower aggregate sales observed in the box plot analysis
This comprehensive analysis of Walmart’s sales data from 2010-2012 across 45 stores reveals significant insights into retail performance drivers, seasonal patterns, and economic influences. The study demonstrates that sales performance is multifaceted, influenced by both internal operational factors and external economic conditions.
Peak Performance: 2011 represented the strongest sales year across the store network Concerning Decline: 2012 showed significant sales degradation, with a 7.6% average decrease from 2011 levels Store Variability: Performance varied dramatically across locations, with top-performing stores generating 3-4x more revenue than bottom performers
Persistent Issues: Stores 33, 14, 35, and 36 demonstrated consistent performance challenges Store 33: Consistently ranked lowest in annual sales across all three years Stores 14 & 35: Showed progressive sales deterioration from 2010-2012 Store 36: Exhibited atypical correlation patterns with economic indicators
Temperature: Generally negative correlation with sales (-0.2 to -0.4 range), suggesting seasonal shopping patterns CPI: Mixed impacts with predominantly positive correlations, except Store 36 showing strong negative correlation (-0.92) Fuel Prices: Varied effects across stores, indicating regional sensitivity differences Unemployment: Primarily negative correlations, though Stores 35 and 36 showed counter-intuitive positive relationships
Holiday Premium: Average 15-25% sales increase during holiday weeks across most stores Temperature Segments: “Cold” weather periods (30-60°F) showed highest sales consistency Quarterly Patterns: Q4 consistently strongest performer due to holiday season
Temporal Scope: Limited to 2010-2012 data, missing recent trends and COVID-era impacts Geographic Granularity: Store-level analysis without regional clustering External Factors: Limited incorporation of local competition, demographic changes, or marketing initiatives Data Completeness: Data coverage analysis reveals the sales decline in 2012 Q4 is attributable to incomplete data rather than performance issues. While 2011 has complete 52-week records, 2012 contains only 43 weeks of data, creating artificial declines in the quarterly analysis.
Machine Learning: Implement predictive models for sales forecasting Advanced Visualization Development, Interactive Dashboard Geospatial Analysis: Map store performance against regional economic indicators
This analysis demonstrates that Walmart’s sales performance is influenced by a complex interplay of store-specific factors, economic conditions, and seasonal patterns. The findings highlight both the resilience of the retail operation and significant opportunities for targeted improvements. By leveraging these insights and pursuing the recommended future work, Walmart can enhance its strategic decision-making, optimize store performance, and maintain competitive advantage in the evolving retail landscape.
The most significant opportunity lies in moving from reactive analysis to predictive, prescriptive analytics that can anticipate performance challenges and recommend specific interventions before they impact revenue.