Advanced Walmart Data Analysis

1 Introduction

This project applies Basic R Programming concepts to Walmart sales data for statistical, business, and predictive analysis. It includes data cleaning, outlier removal, descriptive statistics, visualization, correlation, and regression to understand sales patterns and business performance. The analysis helps transform raw retail data into meaningful insights using practical R functions.

2 Basic Dataset Structure

getwd()

## [1] "/Users/trishita/Downloads"

setwd("~/Downloads")
data <- read.csv("Walmart.csv", stringsAsFactors = FALSE)
nrow(data)

## [1] 6435

ncol(data)

## [1] 8

2.1 Q: Are the variables in the dataset correctly structured for analysis?

str(data)

## 'data.frame':    6435 obs. of  8 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr  "05-02-2010" "12-02-2010" "19-02-2010" "26-02-2010" ...
##  $ Weekly_Sales: num  1643691 1641957 1611968 1409728 1554807 ...
##  $ Holiday_Flag: int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Temperature : num  42.3 38.5 39.9 46.6 46.5 ...
##  $ Fuel_Price  : num  2.57 2.55 2.51 2.56 2.62 ...
##  $ CPI         : num  211 211 211 211 211 ...
##  $ Unemployment: num  8.11 8.11 8.11 8.11 8.11 ...

2.2 Q: As the Date variable is in character can it be converted into proper date format for analysis?

data$Date <- as.Date(data$Date, format = "%d-%m-%Y")
str(data$Date)

##  Date[1:6435], format: "2010-02-05" "2010-02-12" "2010-02-19" "2010-02-26" "2010-03-05" ...

summary(data$Date)

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2010-02-05" "2010-10-08" "2011-06-17" "2011-06-17" "2012-02-24" "2012-10-26"

Interpretation: The Date column is successfully converted from character to Date type. The data spans from February 2010 to October 2012 — approximately 2.5 years of weekly sales records.

3 Pre-Processing

3.1 Q: Are there any missing values present in the dataset?

colSums(is.na(data))

##        Store         Date Weekly_Sales Holiday_Flag  Temperature   Fuel_Price 
##            0            0            0            0            0            0 
##          CPI Unemployment 
##            0            0

Interpretation: There are no missing values in any column of the dataset, meaning the data is complete and no imputation or row removal is required before analysis.

3.2 Q: Are there any duplicate records present in the dataset?

n_before <- nrow(data)
data <- unique(data)
n_after <- nrow(data)

cat("Before:", n_before, "\n")

## Before: 6435

cat("After :", n_after, "\n")

## After : 6435

cat("Removed:", n_before - n_after, "duplicate(s)\n")

## Removed: 0 duplicate(s)

Interpretation: No duplicate rows were found in the dataset, confirming that each record represents a unique store-week combination and no row needs to be removed.

3.3 Q: How can the variables be converted to correct data types and new features be created?

data$Holiday_Flag <- as.factor(data$Holiday_Flag)
data$Month <- as.numeric(format(data$Date, "%m"))
data$Year <- as.numeric(format(data$Date, "%Y"))
data$Sales_Category <- ifelse(data$Weekly_Sales > mean(data$Weekly_Sales), "High", "Low")
data$Bonus_Sales <- data$Weekly_Sales * 0.10
str(data)

## 'data.frame':    6435 obs. of  12 variables:
##  $ Store         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date          : Date, format: "2010-02-05" "2010-02-12" ...
##  $ Weekly_Sales  : num  1643691 1641957 1611968 1409728 1554807 ...
##  $ Holiday_Flag  : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Temperature   : num  42.3 38.5 39.9 46.6 46.5 ...
##  $ Fuel_Price    : num  2.57 2.55 2.51 2.56 2.62 ...
##  $ CPI           : num  211 211 211 211 211 ...
##  $ Unemployment  : num  8.11 8.11 8.11 8.11 8.11 ...
##  $ Month         : num  2 2 2 2 3 3 3 3 4 4 ...
##  $ Year          : num  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Sales_Category: chr  "High" "High" "High" "High" ...
##  $ Bonus_Sales   : num  164369 164196 161197 140973 155481 ...

Interpretation: We fixed the data types and added 4 new columns that will help us analyse sales patterns by time, performance category, and bonus calculations.

3.4 Q: Show The summary of the dataset after cleaning.

summary(data)

##      Store         Date             Weekly_Sales     Holiday_Flag
##  Min.   : 1   Min.   :2010-02-05   Min.   : 209986   0:5985      
##  1st Qu.:12   1st Qu.:2010-10-08   1st Qu.: 553350   1: 450      
##  Median :23   Median :2011-06-17   Median : 960746               
##  Mean   :23   Mean   :2011-06-17   Mean   :1046965               
##  3rd Qu.:34   3rd Qu.:2012-02-24   3rd Qu.:1420159               
##  Max.   :45   Max.   :2012-10-26   Max.   :3818686               
##   Temperature       Fuel_Price         CPI         Unemployment   
##  Min.   : -2.06   Min.   :2.472   Min.   :126.1   Min.   : 3.879  
##  1st Qu.: 47.46   1st Qu.:2.933   1st Qu.:131.7   1st Qu.: 6.891  
##  Median : 62.67   Median :3.445   Median :182.6   Median : 7.874  
##  Mean   : 60.66   Mean   :3.359   Mean   :171.6   Mean   : 7.999  
##  3rd Qu.: 74.94   3rd Qu.:3.735   3rd Qu.:212.7   3rd Qu.: 8.622  
##  Max.   :100.14   Max.   :4.468   Max.   :227.2   Max.   :14.313  
##      Month             Year        Sales_Category  Bonus_Sales    
##  Min.   : 1.000   Min.   :2010   Length   :6435   Min.   : 20999  
##  1st Qu.: 4.000   1st Qu.:2010   N.unique :   2   1st Qu.: 55335  
##  Median : 6.000   Median :2011   N.blank  :   0   Median : 96075  
##  Mean   : 6.448   Mean   :2011   Min.nchar:   3   Mean   :104696  
##  3rd Qu.: 9.000   3rd Qu.:2012   Max.nchar:   4   3rd Qu.:142016  
##  Max.   :12.000   Max.   :2012                    Max.   :381869

Interpretation: The summary gives us a snapshot of every column — the very high maximum in Weekly Sales confirms we need to check and remove outliers next. ## Outlier Detection ### Q: Are there any extreme values present in Weekly Sales?

ggplot(data, aes(y = Weekly_Sales)) +
  geom_boxplot(fill = "orange", color = "black") +
  scale_y_continuous(labels = comma) +
  labs(title = "Boxplot of Weekly Sales (Before Outlier Removal)",
       y = "Weekly Sales")

Interpretation: Through this boxplot we can see the dots floating above the box are outliers — unusually high sales weeks that don’t represent typical store performance and needs to be removed.

3.5 Q: How can outliers be detected and removed from Weekly Sales using the IQR method?

Q1 <- quantile(data$Weekly_Sales, 0.25)
Q3 <- quantile(data$Weekly_Sales, 0.75)
IQRV <- IQR(data$Weekly_Sales)
lower <- Q1 - 1.5 * IQRV
upper <- Q3 + 1.5 * IQRV
Q1

##      25% 
## 553350.1

Q3

##     75% 
## 1420159

IQRV

## [1] 866808.6

lower

##       25% 
## -746862.7

upper

##     75% 
## 2720371

data_clean <- data[data$Weekly_Sales >= lower & data$Weekly_Sales <= upper, ]
nrow(data)

## [1] 6435

nrow(data_clean)

## [1] 6401

Interpretation: We calculated a boundary using Q1, Q3 and IQR — any sales value beyond that boundary is considered abnormal and removed from data_clean.A total of 188 outlier rows were removed, leaving 6,247 clean records.

3.6 Q: How does the distribution of Weekly Sales look after removing outliers?

ggplot(data_clean, aes(y = Weekly_Sales)) +
  geom_boxplot(fill = "yellow", color = "black") +
  scale_y_continuous(labels = comma) +
  labs(title = "Boxplot of Weekly Sales (After Outlier Removal)",
       y = "Weekly Sales")

Interpretation: Compared to the previous boxplot, the dots above are gone — the data now represents typical weekly sales performance without extreme spikes pulling our results.

##Descriptive Statistics ### Q: What are the central tendency and spread measures of Weekly Sales?

mean(data_clean$Weekly_Sales)

## [1] 1036130

median(data_clean$Weekly_Sales)

## [1] 957298.3

sd(data_clean$Weekly_Sales)

## [1] 545196.1

var(data_clean$Weekly_Sales)

## [1] 297238739401

Interpretation: On average a store makes about $1 million per week, but there is a lot of variation — some stores make much more and some much less than the average.

3.7 Q: What are the quantile values and interquartile range of Weekly Sales?

quantile(data_clean$Weekly_Sales)

##        0%       25%       50%       75%      100% 
##  209986.2  551743.1  957298.3 1414564.5 2685351.8

IQR(data_clean$Weekly_Sales)

## [1] 862821.5

Interpretation: Quantiles divide our data into 4 equal parts — this tells us that most stores fall between $553K and $1.4M in weekly sales, which is our “normal” range of performance.

3.8 Q: What are the minimum and maximum values of Weekly Sales?

min(data_clean$Weekly_Sales)

## [1] 209986.2

max(data_clean$Weekly_Sales)

## [1] 2685352

Interpretation: Even the lowest performing store makes about $210K a week, while the best performing store makes nearly $2.7 million — that’s almost 13 times more, showing how differently stores perform across the country.

3.9 Q: How are Weekly Sales distributed between High and Low performance categories?

table(data_clean$Sales_Category)

## 
## High  Low 
## 2842 3559

Interpretation: Slightly more than half the store-weeks performed below average — meaning a good number of stores consistently struggle to cross the average sales mark.

3.10 Q: How does the average Weekly Sales vary across different stores?

store_avg <- aggregate(Weekly_Sales ~ Store, data = data_clean, mean)
head(store_avg, 10)

##    Store Weekly_Sales
## 1      1    1555264.4
## 2      2    1905830.2
## 3      3     402704.4
## 4      4    2051352.0
## 5      5     318011.8
## 6      6    1556539.1
## 7      7     570617.3
## 8      8     908749.5
## 9      9     543980.6
## 10    10    1852745.5

Interpretation: Not all Walmart stores perform the same — some stores consistently sell much more than others every single week, which could be due to location, store size, or local population density.

3.11 Q: Which stores have the highest average Weekly Sales?

avg_sales <- aggregate(Weekly_Sales ~ Store, data = data_clean, mean)
top_stores <- avg_sales[order(-avg_sales$Weekly_Sales), ]
head(top_stores)

##    Store Weekly_Sales
## 20    20      2058998
## 4      4      2051352
## 14    14      1986529
## 13    13      1957682
## 2      2      1905830
## 10    10      1852745

Interpretation: Store 4 is the best performing Walmart store in this dataset — it makes almost $1.5 million every single week on average, nearly double what some of the lower performing stores make.

3.12 Q: How does the presence of holidays affect Weekly Sales?

aggregate(Weekly_Sales ~ Holiday_Flag, data = data_clean, mean)

##   Holiday_Flag Weekly_Sales
## 1            0      1032370
## 2            1      1086950

Interpretation: Stores sell slightly more during holiday weeks — about $28,000 more on average. The difference exists but is not as dramatic as one might expect, suggesting other factors like store size and location matter more than holidays alone.

3.13 Q: What are the maximum and minimum Weekly Sales recorded for each store?

aggregate(Weekly_Sales ~ Store, data = data_clean, max)

##    Store Weekly_Sales
## 1      1    2387950.2
## 2      2    2658725.3
## 3      3     605990.4
## 4      4    2508955.2
## 5      5     507900.1
## 6      6    2644633.0
## 7      7    1059715.3
## 8      8    1511641.1
## 9      9     905324.7
## 10    10    2555031.2
## 11    11    2306265.4
## 12    12    1768249.9
## 13    13    2462779.1
## 14    14    2685351.8
## 15    15    1368318.2
## 16    16    1004730.7
## 17    17    1309226.8
## 18    18    2027507.1
## 19    19    2678206.4
## 20    20    2565259.9
## 21    21    1587257.8
## 22    22    1962445.0
## 23    23    2587953.3
## 24    24    2386015.8
## 25    25    1295391.2
## 26    26    1573982.5
## 27    27    2627910.8
## 28    28    2026026.4
## 29    29    1130926.8
## 30    30     519354.9
## 31    31    2068943.0
## 32    32    1959527.0
## 33    33     331173.5
## 34    34    1620748.2
## 35    35    1781867.0
## 36    36     489372.0
## 37    37     605791.5
## 38    38     499267.7
## 39    39    2554482.8
## 40    40    1648829.2
## 41    41    2263722.7
## 42    42     674919.4
## 43    43     725043.0
## 44    44     376233.9
## 45    45    1682862.0

aggregate(Weekly_Sales ~ Store, data = data_clean, min)

##    Store Weekly_Sales
## 1      1    1316899.3
## 2      2    1650394.4
## 3      3     339597.4
## 4      4    1762539.3
## 5      5     260636.7
## 6      6    1261253.2
## 7      7     372673.6
## 8      8     772539.1
## 9      9     452905.2
## 10    10    1627707.3
## 11    11    1100418.7
## 12    12     802105.5
## 13    13    1633663.1
## 14    14    1479514.7
## 15    15     454183.4
## 16    16     368600.0
## 17    17     635862.6
## 18    18     540922.9
## 19    19    1181204.5
## 20    20    1761016.5
## 21    21     596218.2
## 22    22     774262.3
## 23    23    1016756.1
## 24    24    1057290.4
## 25    25     558794.6
## 26    26     809833.2
## 27    27    1263534.9
## 28    28    1079669.1
## 29    29     395987.2
## 30    30     369722.3
## 31    31    1198071.6
## 32    32     955463.8
## 33    33     209986.2
## 34    34     836717.8
## 35    35     576332.1
## 36    36     270678.0
## 37    37     451327.6
## 38    38     303908.8
## 39    39    1158698.4
## 40    40     764014.8
## 41    41     991941.7
## 42    42     428953.6
## 43    43     505405.8
## 44    44     241937.1
## 45    45     617207.6

Interpretation: Some stores have very consistent sales week to week, while others swing wildly between high and low — this helps identify which stores are stable performers and which ones are unpredictable.

3.14 Q: Which records show the highest Weekly Sales values in the dataset?

top_sales <- data_clean[order(-data_clean$Weekly_Sales), ]
head(top_sales, 10)

##      Store       Date Weekly_Sales Holiday_Flag Temperature Fuel_Price      CPI
## 1954    14 2011-11-25      2685352            1       48.71      3.492 188.3504
## 2621    19 2010-12-24      2678206            0       26.05      3.309 132.7477
## 186      2 2010-11-26      2658725            1       62.98      2.735 211.4063
## 814      6 2011-12-23      2644633            0       49.45      3.112 220.9477
## 3761    27 2010-11-26      2627911            1       46.67      3.186 136.6896
## 1860    14 2010-02-05      2623470            0       27.31      2.784 181.8712
## 238      2 2011-11-25      2614202            1       56.36      3.236 218.1130
## 189      2 2010-12-17      2609167            0       47.55      2.869 211.0645
## 1904    14 2010-12-10      2600519            0       30.54      3.109 182.5520
## 1957    14 2011-12-16      2594363            0       39.93      3.413 188.7979
##      Unemployment Month Year Sales_Category Bonus_Sales
## 1954        8.523    11 2011           High    268535.2
## 2621        8.067    12 2010           High    267820.6
## 186         8.163    11 2010           High    265872.5
## 814         6.551    12 2011           High    264463.3
## 3761        8.021    11 2010           High    262791.1
## 1860        8.992     2 2010           High    262347.0
## 238         7.441    11 2011           High    261420.2
## 189         8.163    12 2010           High    260916.7
## 1904        8.724    12 2010           High    260051.9
## 1957        8.523    12 2011           High    259436.3

Interpretation: The biggest sales weeks all happen in December around Christmas — and the same top stores (4, 20, 14) keep appearing, confirming they are the strongest performers in the entire dataset.

4 Visualizations

4.1 Q: What is the frequency distribution of Weekly Sales in the clean dataset?

ggplot(data_clean, aes(x = Weekly_Sales)) +
  geom_histogram(fill = "yellow", color = "black", bins = 30) +
  scale_x_continuous(labels = comma) +
  labs(title = "Histogram of Weekly Sales (Clean Data)",
       x = "Weekly Sales",
       y = "Frequency")

Interpretation: Most stores make between $500K and $1.5M per week — very few stores go beyond that range, which is why the bars get shorter as we move to the right of the chart.

4.2 Q: What is the smooth probability distribution of Weekly Sales?

ggplot(data_clean, aes(x = Weekly_Sales)) +
  geom_density(fill = "pink", alpha = 0.5) +
  scale_x_continuous(labels = comma) +
  labs(title = "Density Plot of Weekly Sales",
       x = "Weekly Sales",
       y = "Density")

Interpretation: Unlike a histogram, the density plot gives a smooth curve — the highest point of the curve shows where most of the weekly sales values are concentrated, which is around $800K to $1M.

4.3 Q: What pattern is observed between Temperature and Weekly Sales?

ggplot(data_clean, aes(x = Temperature, y = Weekly_Sales)) +
  geom_point(color = "blue", alpha = 0.3) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  scale_y_continuous(labels = comma) +
  labs(title = "Temperature vs Weekly Sales",
       x = "Temperature",
       y = "Weekly Sales")

## `geom_smooth()` using formula = 'y ~ x'

Interpretation: The red line is almost flat — meaning temperature has very little effect on weekly sales. Stores sell roughly the same amount regardless of how hot or cold it is outside.

4.4 Q:How has Weekly Sales trended over time across the dataset?

ggplot(data_clean, aes(x = Date, y = Weekly_Sales)) +
  geom_line(color = "red") +
  scale_y_continuous(labels = comma) +
  labs(title = "Weekly Sales Trend Over Time",
       x = "Date",
       y = "Weekly Sales")

Interpretation: The sales line goes up and down throughout the years — the big spikes you see are holiday weeks like Christmas where all stores sell significantly more than usual.

4.5 Q: How can the difference in Weekly Sales between holiday and non-holiday periods be visualized?

holiday_sales <- aggregate(Weekly_Sales ~ Holiday_Flag, data = data_clean, mean)

ggplot(holiday_sales, aes(x = factor(Holiday_Flag), y = Weekly_Sales, fill = factor(Holiday_Flag))) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("steelblue", "tomato")) +
  scale_y_continuous(labels = comma) +
  labs(title = "Average Sales: Holiday vs Non-Holiday",
       x = "Holiday (0 = No, 1 = Yes)",
       y = "Average Weekly Sales")

Interpretation: The two bars are almost the same height — holidays do push sales up a little but not by a huge amount. Other factors like store size and location have a bigger impact on sales than holidays alone.

4.6 Q: How does Weekly Sales vary across different stores using visualization?

ggplot(data_clean, aes(x = factor(Store), y = Weekly_Sales)) +
  geom_boxplot(fill = "lightgreen") +
  scale_y_continuous(labels = comma) +
  labs(title = "Weekly Sales Across Stores",
       x = "Store",
       y = "Weekly Sales") +
  theme(axis.text.x = element_text(angle = 90))

Interpretation: This plot shows all 45 stores side by side — taller boxes mean more variation in sales, higher boxes mean better overall performance. It is easy to spot which stores stand out just by looking at the chart.

4.7 Q: How does the Holiday effect on Weekly Sales differ across top 10 stores?

top10_stores <- head(avg_sales[order(-avg_sales$Weekly_Sales), "Store"], 10)

store_holiday <- aggregate(Weekly_Sales ~ Store + Holiday_Flag, 
                           data = data_clean[data_clean$Store %in% top10_stores, ], 
                           mean)

ggplot(store_holiday, aes(x = factor(Store), y = Weekly_Sales, fill = factor(Holiday_Flag))) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("orange", "skyblue"),
                    labels = c("Non-Holiday", "Holiday")) +
  scale_y_continuous(labels = comma) +
  labs(title = "Average Weekly Sales by Store and Holiday Flag (Top 10 Stores)",
       x = "Store",
       y = "Average Weekly Sales",
       fill = "Holiday") +
  theme(axis.text.x = element_text(angle = 45))

Interpretation: By placing two bars side by side for each store, we can directly compare how much extra sales each store generates during holiday weeks compared to normal weeks.

5 Correlation Analysis

5.1 Q: What are the pairwise correlations between Weekly Sales and other numeric variables?

num_data <- data_clean[, c("Weekly_Sales", "Temperature", "Fuel_Price", "CPI", "Unemployment")]
cor(num_data)

##              Weekly_Sales Temperature  Fuel_Price         CPI Unemployment
## Weekly_Sales   1.00000000 -0.04434018  0.01818929 -0.06961729  -0.10429751
## Temperature   -0.04434018  1.00000000  0.14307972  0.17651002   0.09926623
## Fuel_Price     0.01818929  0.14307972  1.00000000 -0.17207799  -0.03546923
## CPI           -0.06961729  0.17651002 -0.17207799  1.00000000  -0.30415811
## Unemployment  -0.10429751  0.09926623 -0.03546923 -0.30415811   1.00000000

Interpretation: None of the economic variables strongly predict weekly sales on their own. But CPI and Fuel Price move together very closely — when fuel prices rise, CPI tends to rise too, which makes sense in the real world.

5.2 Q: How can the correlation matrix be visualized as a color coded heatmap?

corrplot(cor(num_data), 
         method = "color",
         col = colorRampPalette(c("blue", "white", "red"))(200),
         addCoef.col = "black",
         number.cex = 0.8,
         tl.col = "black",
         tl.srt = 45,
         title = "Correlation Heatmap of Walmart Data",
         mar = c(0,0,2,0))

Interpretation: Red means two variables move together, blue means they move in opposite directions, and white means no relationship. Weekly Sales row is mostly white — confirming that economic factors alone cannot predict sales well.

5.3 Q: How can a correlation heatmap visually represent relationships among major numeric variables?

num_data <- data_clean[, c("Weekly_Sales", "Temperature", "Fuel_Price", "CPI", "Unemployment")]
num_data <- na.omit(num_data)
corr_matrix <- cor(num_data)
corrplot(corr_matrix, method = "color")

Interpretation: The heatmap shows Weekly Sales has weak correlation with major economic variables, meaning these factors alone do not strongly predict sales. Store-specific or operational factors likely have a greater impact on Walmart’s sales performance.

6 Regression Analysis

6.1 Q: How does Temperature alone predict Weekly Sales?

model1 <- lm(Weekly_Sales ~ Temperature, data = data_clean)
summary(model1)

## 
## Call:
## lm(formula = Weekly_Sales ~ Temperature, data = data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -838798 -482502  -82523  384386 1633389 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 1115899.2    23476.3   47.53 < 0.0000000000000002 ***
## Temperature   -1312.6      369.7   -3.55             0.000387 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 544700 on 6399 degrees of freedom
## Multiple R-squared:  0.001966,   Adjusted R-squared:  0.00181 
## F-statistic: 12.61 on 1 and 6399 DF,  p-value: 0.0003874

Interpretation: Temperature does have some effect on sales but it is extremely small — knowing the temperature alone tells us almost nothing useful about how much a store will sell that week.

6.2 Q: How do Temperature, Fuel Price, CPI and Unemployment jointly predict Weekly Sales?

model2 <- lm(Weekly_Sales ~ Temperature + Fuel_Price + CPI + Unemployment, 
             data = data_clean)
summary(model2)

## 
## Call:
## lm(formula = Weekly_Sales ~ Temperature + Fuel_Price + CPI + 
##     Unemployment, data = data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -922461 -473019 -105689  395279 1692165 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  1651066.0    77233.1  21.378 < 0.0000000000000002 ***
## Temperature     -318.3      384.5  -0.828                0.408    
## Fuel_Price     -4817.1    15251.5  -0.316                0.752    
## CPI            -1524.2      189.4  -8.048 0.000000000000000992 ***
## Unemployment  -39711.8     3848.3 -10.319 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 539200 on 6396 degrees of freedom
## Multiple R-squared:  0.02234,    Adjusted R-squared:  0.02172 
## F-statistic: 36.53 on 4 and 6396 DF,  p-value: < 0.00000000000000022

Interpretation: Adding more variables improved the model slightly but it still explains very little of what drives weekly sales — this tells us that store level factors like location and size matter far more than economic conditions.

6.3 Q.How can we visualize the Actual vs Predicted Sales from the regression model?

data_clean$Predicted <- predict(model2)

ggplot(data_clean, aes(x = Predicted, y = Weekly_Sales)) +
  geom_point(color = "yellow", alpha = 0.3) +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(title = "Actual vs Predicted Weekly Sales",
       x = "Predicted Sales",
       y = "Actual Sales")

Interpretation: The red line shows where perfect predictions would fall — since most yellow dots are far from it, our model is not very accurate at predicting exact weekly sales for individual stores.

6.4 Q: How can we predict Weekly Sales for a new set of economic conditions?

new_data <- data.frame(
  Temperature  = 70,
  Fuel_Price   = 3,
  CPI          = 220,
  Unemployment = 7
)

predict(model2, newdata = new_data)

##       1 
## 1001020

Interpretation: We gave the model some economic conditions and it predicted about $1 million in weekly sales — which is close to the average, showing the model tends to predict near the mean rather than capturing store specific performance.

6.5 Q:How can we improve the regression model by including Store as a predictor?

model3 <- lm(Weekly_Sales ~ Temperature + Fuel_Price + CPI + Unemployment + factor(Store),
             data = data_clean)
summary(model3)$r.squared

## [1] 0.9373765

summary(model3)$adj.r.squared

## [1] 0.9369033

Interpretation: Including Store as a predictor significantly improves the regression model, showing that store-specific factors strongly influence Weekly Sales. The higher R-squared values indicate much better predictive power compared to using economic variables alone.

6.6 Q:Show how much the model improved by comparing actual vs predicted models

data_clean$Predicted_New <- predict(model3)
ggplot(data_clean, aes(x = Predicted_New, y = Weekly_Sales)) +
  geom_point(color = "green", alpha = 0.3) +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(title = "Actual vs Predicted Weekly Sales (Improved Model)",
       x = "Predicted Sales",
       y = "Actual Sales")

Interpretation: The old model did not know which store it was predicting for — once we told it the store identity, it became 15 times more accurate. This proves that store location and size matter far more than economic conditions in predicting weekly sales.

6.7 Q:Show the difference between the R² value between the two models

cat("Weak Model R²    :", round(summary(model2)$r.squared, 4), "\n")

## Weak Model R²    : 0.0223

cat("Improved Model R²:", round(summary(model3)$r.squared, 4), "\n")

## Improved Model R²: 0.9374

Interpretation: The improved model explains 36% of the variation in weekly sales compared to just 2.3% before — a clear and significant improvement just by adding store information.

7 Conclusion

The Walmart Sales Analysis project showed that sales performance is influenced by multiple factors including store performance, seasonal trends, holidays, and economic conditions such as CPI and unemployment. Data cleaning and visualization improved analytical accuracy, while regression models provided useful predictive insights despite limited standalone predictive strength. Overall, the project demonstrated how R can effectively convert raw sales data into meaningful business intelligence.