This project applies Basic R Programming concepts to Walmart sales data for statistical, business, and predictive analysis. It includes data cleaning, outlier removal, descriptive statistics, visualization, correlation, and regression to understand sales patterns and business performance. The analysis helps transform raw retail data into meaningful insights using practical R functions.
getwd()
## [1] "/Users/trishita/Downloads"
setwd("~/Downloads")
data <- read.csv("Walmart.csv", stringsAsFactors = FALSE)
nrow(data)
## [1] 6435
ncol(data)
## [1] 8
str(data)
## 'data.frame': 6435 obs. of 8 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr "05-02-2010" "12-02-2010" "19-02-2010" "26-02-2010" ...
## $ Weekly_Sales: num 1643691 1641957 1611968 1409728 1554807 ...
## $ Holiday_Flag: int 0 1 0 0 0 0 0 0 0 0 ...
## $ Temperature : num 42.3 38.5 39.9 46.6 46.5 ...
## $ Fuel_Price : num 2.57 2.55 2.51 2.56 2.62 ...
## $ CPI : num 211 211 211 211 211 ...
## $ Unemployment: num 8.11 8.11 8.11 8.11 8.11 ...
data$Date <- as.Date(data$Date, format = "%d-%m-%Y")
str(data$Date)
## Date[1:6435], format: "2010-02-05" "2010-02-12" "2010-02-19" "2010-02-26" "2010-03-05" ...
summary(data$Date)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2010-02-05" "2010-10-08" "2011-06-17" "2011-06-17" "2012-02-24" "2012-10-26"
Interpretation: The Date column is successfully converted from character to Date type. The data spans from February 2010 to October 2012 — approximately 2.5 years of weekly sales records.
colSums(is.na(data))
## Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price
## 0 0 0 0 0 0
## CPI Unemployment
## 0 0
Interpretation: There are no missing values in any column of the dataset, meaning the data is complete and no imputation or row removal is required before analysis.
n_before <- nrow(data)
data <- unique(data)
n_after <- nrow(data)
cat("Before:", n_before, "\n")
## Before: 6435
cat("After :", n_after, "\n")
## After : 6435
cat("Removed:", n_before - n_after, "duplicate(s)\n")
## Removed: 0 duplicate(s)
Interpretation: No duplicate rows were found in the dataset, confirming that each record represents a unique store-week combination and no row needs to be removed.
data$Holiday_Flag <- as.factor(data$Holiday_Flag)
data$Month <- as.numeric(format(data$Date, "%m"))
data$Year <- as.numeric(format(data$Date, "%Y"))
data$Sales_Category <- ifelse(data$Weekly_Sales > mean(data$Weekly_Sales), "High", "Low")
data$Bonus_Sales <- data$Weekly_Sales * 0.10
str(data)
## 'data.frame': 6435 obs. of 12 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "2010-02-05" "2010-02-12" ...
## $ Weekly_Sales : num 1643691 1641957 1611968 1409728 1554807 ...
## $ Holiday_Flag : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
## $ Temperature : num 42.3 38.5 39.9 46.6 46.5 ...
## $ Fuel_Price : num 2.57 2.55 2.51 2.56 2.62 ...
## $ CPI : num 211 211 211 211 211 ...
## $ Unemployment : num 8.11 8.11 8.11 8.11 8.11 ...
## $ Month : num 2 2 2 2 3 3 3 3 4 4 ...
## $ Year : num 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Sales_Category: chr "High" "High" "High" "High" ...
## $ Bonus_Sales : num 164369 164196 161197 140973 155481 ...
Interpretation: We fixed the data types and added 4 new columns that will help us analyse sales patterns by time, performance category, and bonus calculations.
summary(data)
## Store Date Weekly_Sales Holiday_Flag
## Min. : 1 Min. :2010-02-05 Min. : 209986 0:5985
## 1st Qu.:12 1st Qu.:2010-10-08 1st Qu.: 553350 1: 450
## Median :23 Median :2011-06-17 Median : 960746
## Mean :23 Mean :2011-06-17 Mean :1046965
## 3rd Qu.:34 3rd Qu.:2012-02-24 3rd Qu.:1420159
## Max. :45 Max. :2012-10-26 Max. :3818686
## Temperature Fuel_Price CPI Unemployment
## Min. : -2.06 Min. :2.472 Min. :126.1 Min. : 3.879
## 1st Qu.: 47.46 1st Qu.:2.933 1st Qu.:131.7 1st Qu.: 6.891
## Median : 62.67 Median :3.445 Median :182.6 Median : 7.874
## Mean : 60.66 Mean :3.359 Mean :171.6 Mean : 7.999
## 3rd Qu.: 74.94 3rd Qu.:3.735 3rd Qu.:212.7 3rd Qu.: 8.622
## Max. :100.14 Max. :4.468 Max. :227.2 Max. :14.313
## Month Year Sales_Category Bonus_Sales
## Min. : 1.000 Min. :2010 Length :6435 Min. : 20999
## 1st Qu.: 4.000 1st Qu.:2010 N.unique : 2 1st Qu.: 55335
## Median : 6.000 Median :2011 N.blank : 0 Median : 96075
## Mean : 6.448 Mean :2011 Min.nchar: 3 Mean :104696
## 3rd Qu.: 9.000 3rd Qu.:2012 Max.nchar: 4 3rd Qu.:142016
## Max. :12.000 Max. :2012 Max. :381869
Interpretation: The summary gives us a snapshot of every column — the very high maximum in Weekly Sales confirms we need to check and remove outliers next. ## Outlier Detection ### Q: Are there any extreme values present in Weekly Sales?
ggplot(data, aes(y = Weekly_Sales)) +
geom_boxplot(fill = "orange", color = "black") +
scale_y_continuous(labels = comma) +
labs(title = "Boxplot of Weekly Sales (Before Outlier Removal)",
y = "Weekly Sales")
Interpretation: Through this boxplot we can see the dots floating above the box are outliers — unusually high sales weeks that don’t represent typical store performance and needs to be removed.
Q1 <- quantile(data$Weekly_Sales, 0.25)
Q3 <- quantile(data$Weekly_Sales, 0.75)
IQRV <- IQR(data$Weekly_Sales)
lower <- Q1 - 1.5 * IQRV
upper <- Q3 + 1.5 * IQRV
Q1
## 25%
## 553350.1
Q3
## 75%
## 1420159
IQRV
## [1] 866808.6
lower
## 25%
## -746862.7
upper
## 75%
## 2720371
data_clean <- data[data$Weekly_Sales >= lower & data$Weekly_Sales <= upper, ]
nrow(data)
## [1] 6435
nrow(data_clean)
## [1] 6401
Interpretation: We calculated a boundary using Q1, Q3 and IQR — any sales value beyond that boundary is considered abnormal and removed from data_clean.A total of 188 outlier rows were removed, leaving 6,247 clean records.
ggplot(data_clean, aes(y = Weekly_Sales)) +
geom_boxplot(fill = "yellow", color = "black") +
scale_y_continuous(labels = comma) +
labs(title = "Boxplot of Weekly Sales (After Outlier Removal)",
y = "Weekly Sales")
Interpretation: Compared to the previous boxplot, the dots above are gone — the data now represents typical weekly sales performance without extreme spikes pulling our results.
##Descriptive Statistics ### Q: What are the central tendency and spread measures of Weekly Sales?
mean(data_clean$Weekly_Sales)
## [1] 1036130
median(data_clean$Weekly_Sales)
## [1] 957298.3
sd(data_clean$Weekly_Sales)
## [1] 545196.1
var(data_clean$Weekly_Sales)
## [1] 297238739401
Interpretation: On average a store makes about $1 million per week, but there is a lot of variation — some stores make much more and some much less than the average.
quantile(data_clean$Weekly_Sales)
## 0% 25% 50% 75% 100%
## 209986.2 551743.1 957298.3 1414564.5 2685351.8
IQR(data_clean$Weekly_Sales)
## [1] 862821.5
Interpretation: Quantiles divide our data into 4 equal parts — this tells us that most stores fall between $553K and $1.4M in weekly sales, which is our “normal” range of performance.
min(data_clean$Weekly_Sales)
## [1] 209986.2
max(data_clean$Weekly_Sales)
## [1] 2685352
Interpretation: Even the lowest performing store makes about $210K a week, while the best performing store makes nearly $2.7 million — that’s almost 13 times more, showing how differently stores perform across the country.
table(data_clean$Sales_Category)
##
## High Low
## 2842 3559
Interpretation: Slightly more than half the store-weeks performed below average — meaning a good number of stores consistently struggle to cross the average sales mark.
store_avg <- aggregate(Weekly_Sales ~ Store, data = data_clean, mean)
head(store_avg, 10)
## Store Weekly_Sales
## 1 1 1555264.4
## 2 2 1905830.2
## 3 3 402704.4
## 4 4 2051352.0
## 5 5 318011.8
## 6 6 1556539.1
## 7 7 570617.3
## 8 8 908749.5
## 9 9 543980.6
## 10 10 1852745.5
Interpretation: Not all Walmart stores perform the same — some stores consistently sell much more than others every single week, which could be due to location, store size, or local population density.
avg_sales <- aggregate(Weekly_Sales ~ Store, data = data_clean, mean)
top_stores <- avg_sales[order(-avg_sales$Weekly_Sales), ]
head(top_stores)
## Store Weekly_Sales
## 20 20 2058998
## 4 4 2051352
## 14 14 1986529
## 13 13 1957682
## 2 2 1905830
## 10 10 1852745
Interpretation: Store 4 is the best performing Walmart store in this dataset — it makes almost $1.5 million every single week on average, nearly double what some of the lower performing stores make.
aggregate(Weekly_Sales ~ Holiday_Flag, data = data_clean, mean)
## Holiday_Flag Weekly_Sales
## 1 0 1032370
## 2 1 1086950
Interpretation: Stores sell slightly more during holiday weeks — about $28,000 more on average. The difference exists but is not as dramatic as one might expect, suggesting other factors like store size and location matter more than holidays alone.
aggregate(Weekly_Sales ~ Store, data = data_clean, max)
## Store Weekly_Sales
## 1 1 2387950.2
## 2 2 2658725.3
## 3 3 605990.4
## 4 4 2508955.2
## 5 5 507900.1
## 6 6 2644633.0
## 7 7 1059715.3
## 8 8 1511641.1
## 9 9 905324.7
## 10 10 2555031.2
## 11 11 2306265.4
## 12 12 1768249.9
## 13 13 2462779.1
## 14 14 2685351.8
## 15 15 1368318.2
## 16 16 1004730.7
## 17 17 1309226.8
## 18 18 2027507.1
## 19 19 2678206.4
## 20 20 2565259.9
## 21 21 1587257.8
## 22 22 1962445.0
## 23 23 2587953.3
## 24 24 2386015.8
## 25 25 1295391.2
## 26 26 1573982.5
## 27 27 2627910.8
## 28 28 2026026.4
## 29 29 1130926.8
## 30 30 519354.9
## 31 31 2068943.0
## 32 32 1959527.0
## 33 33 331173.5
## 34 34 1620748.2
## 35 35 1781867.0
## 36 36 489372.0
## 37 37 605791.5
## 38 38 499267.7
## 39 39 2554482.8
## 40 40 1648829.2
## 41 41 2263722.7
## 42 42 674919.4
## 43 43 725043.0
## 44 44 376233.9
## 45 45 1682862.0
aggregate(Weekly_Sales ~ Store, data = data_clean, min)
## Store Weekly_Sales
## 1 1 1316899.3
## 2 2 1650394.4
## 3 3 339597.4
## 4 4 1762539.3
## 5 5 260636.7
## 6 6 1261253.2
## 7 7 372673.6
## 8 8 772539.1
## 9 9 452905.2
## 10 10 1627707.3
## 11 11 1100418.7
## 12 12 802105.5
## 13 13 1633663.1
## 14 14 1479514.7
## 15 15 454183.4
## 16 16 368600.0
## 17 17 635862.6
## 18 18 540922.9
## 19 19 1181204.5
## 20 20 1761016.5
## 21 21 596218.2
## 22 22 774262.3
## 23 23 1016756.1
## 24 24 1057290.4
## 25 25 558794.6
## 26 26 809833.2
## 27 27 1263534.9
## 28 28 1079669.1
## 29 29 395987.2
## 30 30 369722.3
## 31 31 1198071.6
## 32 32 955463.8
## 33 33 209986.2
## 34 34 836717.8
## 35 35 576332.1
## 36 36 270678.0
## 37 37 451327.6
## 38 38 303908.8
## 39 39 1158698.4
## 40 40 764014.8
## 41 41 991941.7
## 42 42 428953.6
## 43 43 505405.8
## 44 44 241937.1
## 45 45 617207.6
Interpretation: Some stores have very consistent sales week to week, while others swing wildly between high and low — this helps identify which stores are stable performers and which ones are unpredictable.
top_sales <- data_clean[order(-data_clean$Weekly_Sales), ]
head(top_sales, 10)
## Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI
## 1954 14 2011-11-25 2685352 1 48.71 3.492 188.3504
## 2621 19 2010-12-24 2678206 0 26.05 3.309 132.7477
## 186 2 2010-11-26 2658725 1 62.98 2.735 211.4063
## 814 6 2011-12-23 2644633 0 49.45 3.112 220.9477
## 3761 27 2010-11-26 2627911 1 46.67 3.186 136.6896
## 1860 14 2010-02-05 2623470 0 27.31 2.784 181.8712
## 238 2 2011-11-25 2614202 1 56.36 3.236 218.1130
## 189 2 2010-12-17 2609167 0 47.55 2.869 211.0645
## 1904 14 2010-12-10 2600519 0 30.54 3.109 182.5520
## 1957 14 2011-12-16 2594363 0 39.93 3.413 188.7979
## Unemployment Month Year Sales_Category Bonus_Sales
## 1954 8.523 11 2011 High 268535.2
## 2621 8.067 12 2010 High 267820.6
## 186 8.163 11 2010 High 265872.5
## 814 6.551 12 2011 High 264463.3
## 3761 8.021 11 2010 High 262791.1
## 1860 8.992 2 2010 High 262347.0
## 238 7.441 11 2011 High 261420.2
## 189 8.163 12 2010 High 260916.7
## 1904 8.724 12 2010 High 260051.9
## 1957 8.523 12 2011 High 259436.3
Interpretation: The biggest sales weeks all happen in December around Christmas — and the same top stores (4, 20, 14) keep appearing, confirming they are the strongest performers in the entire dataset.
ggplot(data_clean, aes(x = Weekly_Sales)) +
geom_histogram(fill = "yellow", color = "black", bins = 30) +
scale_x_continuous(labels = comma) +
labs(title = "Histogram of Weekly Sales (Clean Data)",
x = "Weekly Sales",
y = "Frequency")
Interpretation: Most stores make between $500K and $1.5M per week — very few stores go beyond that range, which is why the bars get shorter as we move to the right of the chart.
ggplot(data_clean, aes(x = Weekly_Sales)) +
geom_density(fill = "pink", alpha = 0.5) +
scale_x_continuous(labels = comma) +
labs(title = "Density Plot of Weekly Sales",
x = "Weekly Sales",
y = "Density")
Interpretation: Unlike a histogram, the density plot gives a smooth curve — the highest point of the curve shows where most of the weekly sales values are concentrated, which is around $800K to $1M.
ggplot(data_clean, aes(x = Temperature, y = Weekly_Sales)) +
geom_point(color = "blue", alpha = 0.3) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
scale_y_continuous(labels = comma) +
labs(title = "Temperature vs Weekly Sales",
x = "Temperature",
y = "Weekly Sales")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation: The red line is almost flat — meaning temperature has very little effect on weekly sales. Stores sell roughly the same amount regardless of how hot or cold it is outside.
ggplot(data_clean, aes(x = Date, y = Weekly_Sales)) +
geom_line(color = "red") +
scale_y_continuous(labels = comma) +
labs(title = "Weekly Sales Trend Over Time",
x = "Date",
y = "Weekly Sales")
Interpretation: The sales line goes up and down throughout the years — the big spikes you see are holiday weeks like Christmas where all stores sell significantly more than usual.
holiday_sales <- aggregate(Weekly_Sales ~ Holiday_Flag, data = data_clean, mean)
ggplot(holiday_sales, aes(x = factor(Holiday_Flag), y = Weekly_Sales, fill = factor(Holiday_Flag))) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("steelblue", "tomato")) +
scale_y_continuous(labels = comma) +
labs(title = "Average Sales: Holiday vs Non-Holiday",
x = "Holiday (0 = No, 1 = Yes)",
y = "Average Weekly Sales")
Interpretation: The two bars are almost the same height — holidays do push sales up a little but not by a huge amount. Other factors like store size and location have a bigger impact on sales than holidays alone.
ggplot(data_clean, aes(x = factor(Store), y = Weekly_Sales)) +
geom_boxplot(fill = "lightgreen") +
scale_y_continuous(labels = comma) +
labs(title = "Weekly Sales Across Stores",
x = "Store",
y = "Weekly Sales") +
theme(axis.text.x = element_text(angle = 90))
Interpretation: This plot shows all 45 stores side by
side — taller boxes mean more variation in sales, higher boxes mean
better overall performance. It is easy to spot which stores stand out
just by looking at the chart.
top10_stores <- head(avg_sales[order(-avg_sales$Weekly_Sales), "Store"], 10)
store_holiday <- aggregate(Weekly_Sales ~ Store + Holiday_Flag,
data = data_clean[data_clean$Store %in% top10_stores, ],
mean)
ggplot(store_holiday, aes(x = factor(Store), y = Weekly_Sales, fill = factor(Holiday_Flag))) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("orange", "skyblue"),
labels = c("Non-Holiday", "Holiday")) +
scale_y_continuous(labels = comma) +
labs(title = "Average Weekly Sales by Store and Holiday Flag (Top 10 Stores)",
x = "Store",
y = "Average Weekly Sales",
fill = "Holiday") +
theme(axis.text.x = element_text(angle = 45))
Interpretation: By placing two bars side by side for each store, we can directly compare how much extra sales each store generates during holiday weeks compared to normal weeks.
num_data <- data_clean[, c("Weekly_Sales", "Temperature", "Fuel_Price", "CPI", "Unemployment")]
cor(num_data)
## Weekly_Sales Temperature Fuel_Price CPI Unemployment
## Weekly_Sales 1.00000000 -0.04434018 0.01818929 -0.06961729 -0.10429751
## Temperature -0.04434018 1.00000000 0.14307972 0.17651002 0.09926623
## Fuel_Price 0.01818929 0.14307972 1.00000000 -0.17207799 -0.03546923
## CPI -0.06961729 0.17651002 -0.17207799 1.00000000 -0.30415811
## Unemployment -0.10429751 0.09926623 -0.03546923 -0.30415811 1.00000000
Interpretation: None of the economic variables strongly predict weekly sales on their own. But CPI and Fuel Price move together very closely — when fuel prices rise, CPI tends to rise too, which makes sense in the real world.
corrplot(cor(num_data),
method = "color",
col = colorRampPalette(c("blue", "white", "red"))(200),
addCoef.col = "black",
number.cex = 0.8,
tl.col = "black",
tl.srt = 45,
title = "Correlation Heatmap of Walmart Data",
mar = c(0,0,2,0))
Interpretation: Red means two variables move together, blue means they move in opposite directions, and white means no relationship. Weekly Sales row is mostly white — confirming that economic factors alone cannot predict sales well.
num_data <- data_clean[, c("Weekly_Sales", "Temperature", "Fuel_Price", "CPI", "Unemployment")]
num_data <- na.omit(num_data)
corr_matrix <- cor(num_data)
corrplot(corr_matrix, method = "color")
Interpretation: The heatmap shows Weekly Sales has weak correlation with major economic variables, meaning these factors alone do not strongly predict sales. Store-specific or operational factors likely have a greater impact on Walmart’s sales performance.
model1 <- lm(Weekly_Sales ~ Temperature, data = data_clean)
summary(model1)
##
## Call:
## lm(formula = Weekly_Sales ~ Temperature, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -838798 -482502 -82523 384386 1633389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1115899.2 23476.3 47.53 < 0.0000000000000002 ***
## Temperature -1312.6 369.7 -3.55 0.000387 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 544700 on 6399 degrees of freedom
## Multiple R-squared: 0.001966, Adjusted R-squared: 0.00181
## F-statistic: 12.61 on 1 and 6399 DF, p-value: 0.0003874
Interpretation: Temperature does have some effect on sales but it is extremely small — knowing the temperature alone tells us almost nothing useful about how much a store will sell that week.
model2 <- lm(Weekly_Sales ~ Temperature + Fuel_Price + CPI + Unemployment,
data = data_clean)
summary(model2)
##
## Call:
## lm(formula = Weekly_Sales ~ Temperature + Fuel_Price + CPI +
## Unemployment, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -922461 -473019 -105689 395279 1692165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1651066.0 77233.1 21.378 < 0.0000000000000002 ***
## Temperature -318.3 384.5 -0.828 0.408
## Fuel_Price -4817.1 15251.5 -0.316 0.752
## CPI -1524.2 189.4 -8.048 0.000000000000000992 ***
## Unemployment -39711.8 3848.3 -10.319 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 539200 on 6396 degrees of freedom
## Multiple R-squared: 0.02234, Adjusted R-squared: 0.02172
## F-statistic: 36.53 on 4 and 6396 DF, p-value: < 0.00000000000000022
Interpretation: Adding more variables improved the model slightly but it still explains very little of what drives weekly sales — this tells us that store level factors like location and size matter far more than economic conditions.
data_clean$Predicted <- predict(model2)
ggplot(data_clean, aes(x = Predicted, y = Weekly_Sales)) +
geom_point(color = "yellow", alpha = 0.3) +
geom_abline(slope = 1, intercept = 0, color = "red") +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(title = "Actual vs Predicted Weekly Sales",
x = "Predicted Sales",
y = "Actual Sales")
Interpretation: The red line shows where perfect predictions would fall — since most yellow dots are far from it, our model is not very accurate at predicting exact weekly sales for individual stores.
new_data <- data.frame(
Temperature = 70,
Fuel_Price = 3,
CPI = 220,
Unemployment = 7
)
predict(model2, newdata = new_data)
## 1
## 1001020
Interpretation: We gave the model some economic conditions and it predicted about $1 million in weekly sales — which is close to the average, showing the model tends to predict near the mean rather than capturing store specific performance.
model3 <- lm(Weekly_Sales ~ Temperature + Fuel_Price + CPI + Unemployment + factor(Store),
data = data_clean)
summary(model3)$r.squared
## [1] 0.9373765
summary(model3)$adj.r.squared
## [1] 0.9369033
Interpretation: Including Store as a predictor significantly improves the regression model, showing that store-specific factors strongly influence Weekly Sales. The higher R-squared values indicate much better predictive power compared to using economic variables alone.
data_clean$Predicted_New <- predict(model3)
ggplot(data_clean, aes(x = Predicted_New, y = Weekly_Sales)) +
geom_point(color = "green", alpha = 0.3) +
geom_abline(slope = 1, intercept = 0, color = "red") +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(title = "Actual vs Predicted Weekly Sales (Improved Model)",
x = "Predicted Sales",
y = "Actual Sales")
Interpretation: The old model did not know which store it was predicting for — once we told it the store identity, it became 15 times more accurate. This proves that store location and size matter far more than economic conditions in predicting weekly sales.
cat("Weak Model R² :", round(summary(model2)$r.squared, 4), "\n")
## Weak Model R² : 0.0223
cat("Improved Model R²:", round(summary(model3)$r.squared, 4), "\n")
## Improved Model R²: 0.9374
Interpretation: The improved model explains 36% of the variation in weekly sales compared to just 2.3% before — a clear and significant improvement just by adding store information.
The Walmart Sales Analysis project showed that sales performance is influenced by multiple factors including store performance, seasonal trends, holidays, and economic conditions such as CPI and unemployment. Data cleaning and visualization improved analytical accuracy, while regression models provided useful predictive insights despite limited standalone predictive strength. Overall, the project demonstrated how R can effectively convert raw sales data into meaningful business intelligence.