This report analyzes the Walmart Sales dataset using R to understand sales patterns, distribution, and factors influencing performance. Various data analysis techniques such as summary,statistics, visualization, and grouping are applied.
library(readr)
data <-read_csv("Downloads/Walmart.csv")
## Rows: 6435 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (7): Store, Weekly_Sales, Holiday_Flag, Temperature, Fuel_Price, CPI, Un...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## spc_tbl_ [6,435 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Store : num [1:6435] 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr [1:6435] "05-02-2010" "12-02-2010" "19-02-2010" "26-02-2010" ...
## $ Weekly_Sales: num [1:6435] 1643691 1641957 1611968 1409728 1554807 ...
## $ Holiday_Flag: num [1:6435] 0 1 0 0 0 0 0 0 0 0 ...
## $ Temperature : num [1:6435] 42.3 38.5 39.9 46.6 46.5 ...
## $ Fuel_Price : num [1:6435] 2.57 2.55 2.51 2.56 2.62 ...
## $ CPI : num [1:6435] 211 211 211 211 211 ...
## $ Unemployment: num [1:6435] 8.11 8.11 8.11 8.11 8.11 ...
## - attr(*, "spec")=
## .. cols(
## .. Store = col_double(),
## .. Date = col_character(),
## .. Weekly_Sales = col_double(),
## .. Holiday_Flag = col_double(),
## .. Temperature = col_double(),
## .. Fuel_Price = col_double(),
## .. CPI = col_double(),
## .. Unemployment = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Interpretation:
The dataset contains appropriate variables types apart from the Date
variable as it is stored it in character and may require conversion to
avoid conflict later on.
data$Date <- as.Date(data$Date, format = "%d-%m-%Y")
str(data)
## spc_tbl_ [6,435 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Store : num [1:6435] 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date[1:6435], format: "2010-02-05" "2010-02-12" ...
## $ Weekly_Sales: num [1:6435] 1643691 1641957 1611968 1409728 1554807 ...
## $ Holiday_Flag: num [1:6435] 0 1 0 0 0 0 0 0 0 0 ...
## $ Temperature : num [1:6435] 42.3 38.5 39.9 46.6 46.5 ...
## $ Fuel_Price : num [1:6435] 2.57 2.55 2.51 2.56 2.62 ...
## $ CPI : num [1:6435] 211 211 211 211 211 ...
## $ Unemployment: num [1:6435] 8.11 8.11 8.11 8.11 8.11 ...
## - attr(*, "spec")=
## .. cols(
## .. Store = col_double(),
## .. Date = col_character(),
## .. Weekly_Sales = col_double(),
## .. Holiday_Flag = col_double(),
## .. Temperature = col_double(),
## .. Fuel_Price = col_double(),
## .. CPI = col_double(),
## .. Unemployment = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Interpretation:
The Date variable has been successfully converted into date format,
making it suitable for time-based analysis.
summary(data$Weekly_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 209986 553350 960746 1046965 1420159 3818686
Interpretation: Weekly Sales vary a lot across the dataset, showing that sales are not consistent and can be both low and high.
hist(data$Weekly_Sales)
Interpretation:
Most sales values are concentrated in a lower range, while a few high
values extend the distribution.
boxplot(data$Weekly_Sales)
Interpretation:
The boxplot shows the presence of extreme values, indicating that some
sales values are significantly higher than others.
data$Sales_Category <- ifelse(data$Weekly_Sales > mean(data$Weekly_Sales), "High", "Low")
table(data$Sales_Category)
##
## High Low
## 2876 3559
Interpretation:
Weekly Sales are divided into High and Low categories based on average
sales, helping to simplify performance analysis.
high_sales <- subset(data, Weekly_Sales > mean(data$Weekly_Sales))
head(high_sales)
## # A tibble: 6 × 9
## Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2010-02-05 1643691. 0 42.3 2.57 211.
## 2 1 2010-02-12 1641957. 1 38.5 2.55 211.
## 3 1 2010-02-19 1611968. 0 39.9 2.51 211.
## 4 1 2010-02-26 1409728. 0 46.6 2.56 211.
## 5 1 2010-03-05 1554807. 0 46.5 2.62 211.
## 6 1 2010-03-12 1439542. 0 57.8 2.67 211.
## # ℹ 2 more variables: Unemployment <dbl>, Sales_Category <chr>
Interpretation:
The filtered records show observations where Weekly Sales are above
average, highlighting strong performance cases.
aggregate(Weekly_Sales ~ Store, data = data, mean)
## Store Weekly_Sales
## 1 1 1555264.4
## 2 2 1925751.3
## 3 3 402704.4
## 4 4 2094713.0
## 5 5 318011.8
## 6 6 1564728.2
## 7 7 570617.3
## 8 8 908749.5
## 9 9 543980.6
## 10 10 1899424.6
## 11 11 1356383.1
## 12 12 1009001.6
## 13 13 2003620.3
## 14 14 2020978.4
## 15 15 623312.5
## 16 16 519247.7
## 17 17 893581.4
## 18 18 1084718.4
## 19 19 1444999.0
## 20 20 2107676.9
## 21 21 756069.1
## 22 22 1028501.0
## 23 23 1389864.5
## 24 24 1356755.4
## 25 25 706721.5
## 26 26 1002911.8
## 27 27 1775216.2
## 28 28 1323522.2
## 29 29 539451.4
## 30 30 438579.6
## 31 31 1395901.4
## 32 32 1166568.2
## 33 33 259861.7
## 34 34 966781.6
## 35 35 919725.0
## 36 36 373512.0
## 37 37 518900.3
## 38 38 385731.7
## 39 39 1450668.1
## 40 40 964128.0
## 41 41 1268125.4
## 42 42 556403.9
## 43 43 633324.7
## 44 44 302748.9
## 45 45 785981.4
Interpretation:
The Average Weekly Sales differ across stores, indicating variation in
store performance.
avg_sales <- aggregate(Weekly_Sales ~ Store, data = data, mean)
top_stores <- avg_sales[order(-avg_sales$Weekly_Sales), ]
head(top_stores)
## Store Weekly_Sales
## 20 20 2107677
## 4 4 2094713
## 14 14 2020978
## 13 13 2003620
## 2 2 1925751
## 10 10 1899425
Interpretation:
The top stores are identified based on highest average Weekly Sales,
showing the best-performing locations.
aggregate(Weekly_Sales ~ Holiday_Flag, data = data, mean)
## Holiday_Flag Weekly_Sales
## 1 0 1041256
## 2 1 1122888
Interpretation:
Weekly Sales vary between holiday and non-holiday periods, indicating
that holidays influence sales performance.
library(ggplot2)
ggplot(data, aes(x = factor(Holiday_Flag), y = Weekly_Sales)) +
stat_summary(fun = mean, geom = "bar") +
xlab("Holiday Flag (0 = No, 1 = Yes)") +
ylab("Average Weekly Sales")
Interpretation:
The bar chart shows a clear difference in average sales between holiday
and non-holiday periods.
boxplot(Weekly_Sales ~ Store, data = data)
Interpretation:
Weekly Sales vary across stores, showing differences in performance and
sales spread using boxplot.
table(data$Holiday_Flag)
##
## 0 1
## 5985 450
Interpretation:
The dataset contains different counts of holiday and non-holiday
periods, showing their distribution.
summary(data$Temperature)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.06 47.46 62.67 60.66 74.94 100.14
Interpretation:
Temperature varing across observation indicates changing environmental
conditions in the dataset.
plot(data$Temperature, data$Weekly_Sales)
Interpretation:
There is no strong pattern between temperature and Weekly Sales,
indicating a weak relationship.
summary(data$Fuel_Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.472 2.933 3.445 3.359 3.735 4.468
Interpretation:
Fuel Price shows variation across observations, indicating changes in
fuel cost over time.
top_sales <- data[order(-data$Weekly_Sales), ]
head(top_sales)
## # A tibble: 6 × 9
## Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14 2010-12-24 3818686. 0 30.6 3.14 183.
## 2 20 2010-12-24 3766687. 0 25.2 3.14 205.
## 3 10 2010-12-24 3749058. 0 57.1 3.24 127.
## 4 4 2011-12-23 3676389. 0 35.9 3.10 130.
## 5 13 2010-12-24 3595903. 0 34.9 2.85 127.
## 6 13 2011-12-23 3556766. 0 24.8 3.19 130.
## # ℹ 2 more variables: Unemployment <dbl>, Sales_Category <chr>
Interpretation:
The highest sales records are identified, highlighting peak sales
periods in the dataset.
boxplot(data$Weekly_Sales)
Interpretation:
Weekly Sales show variation with the presence of extreme values,
indicating uneven distribution.
The analysis shows that Weekly Sales vary significantly across stores and time. Holidays have an impact on sales, while factors like temperature and fuel price show limited influence. The dataset also contains extreme values, indicating uneven sales distribution.