1 Introduction

This report analyzes the Walmart Sales dataset using R to understand sales patterns, distribution, and factors influencing performance. Various data analysis techniques such as summary,statistics, visualization, and grouping are applied.

library(readr)
data <-read_csv("Downloads/Walmart.csv")
## Rows: 6435 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (7): Store, Weekly_Sales, Holiday_Flag, Temperature, Fuel_Price, CPI, Un...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2 Analysis

2.1 Q1: Are the variables in the dataset correctly structured for analysis?

str(data)
## spc_tbl_ [6,435 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Store       : num [1:6435] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr [1:6435] "05-02-2010" "12-02-2010" "19-02-2010" "26-02-2010" ...
##  $ Weekly_Sales: num [1:6435] 1643691 1641957 1611968 1409728 1554807 ...
##  $ Holiday_Flag: num [1:6435] 0 1 0 0 0 0 0 0 0 0 ...
##  $ Temperature : num [1:6435] 42.3 38.5 39.9 46.6 46.5 ...
##  $ Fuel_Price  : num [1:6435] 2.57 2.55 2.51 2.56 2.62 ...
##  $ CPI         : num [1:6435] 211 211 211 211 211 ...
##  $ Unemployment: num [1:6435] 8.11 8.11 8.11 8.11 8.11 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Store = col_double(),
##   ..   Date = col_character(),
##   ..   Weekly_Sales = col_double(),
##   ..   Holiday_Flag = col_double(),
##   ..   Temperature = col_double(),
##   ..   Fuel_Price = col_double(),
##   ..   CPI = col_double(),
##   ..   Unemployment = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Interpretation:
The dataset contains appropriate variables types apart from the Date variable as it is stored it in character and may require conversion to avoid conflict later on.

2.2 Q2: Can the Date variable be converted into proper date format for analysis?If yes, how?

data$Date <- as.Date(data$Date, format = "%d-%m-%Y")
str(data)
## spc_tbl_ [6,435 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Store       : num [1:6435] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : Date[1:6435], format: "2010-02-05" "2010-02-12" ...
##  $ Weekly_Sales: num [1:6435] 1643691 1641957 1611968 1409728 1554807 ...
##  $ Holiday_Flag: num [1:6435] 0 1 0 0 0 0 0 0 0 0 ...
##  $ Temperature : num [1:6435] 42.3 38.5 39.9 46.6 46.5 ...
##  $ Fuel_Price  : num [1:6435] 2.57 2.55 2.51 2.56 2.62 ...
##  $ CPI         : num [1:6435] 211 211 211 211 211 ...
##  $ Unemployment: num [1:6435] 8.11 8.11 8.11 8.11 8.11 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Store = col_double(),
##   ..   Date = col_character(),
##   ..   Weekly_Sales = col_double(),
##   ..   Holiday_Flag = col_double(),
##   ..   Temperature = col_double(),
##   ..   Fuel_Price = col_double(),
##   ..   CPI = col_double(),
##   ..   Unemployment = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Interpretation:
The Date variable has been successfully converted into date format, making it suitable for time-based analysis.

2.3 Q3: What is the overall statistical summary of Weekly Sales?

summary(data$Weekly_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  209986  553350  960746 1046965 1420159 3818686

Interpretation: Weekly Sales vary a lot across the dataset, showing that sales are not consistent and can be both low and high.

2.4 Q4: What is the distribution pattern of Weekly Sales?

hist(data$Weekly_Sales)

Interpretation:
Most sales values are concentrated in a lower range, while a few high values extend the distribution.

2.5 Q5: Are there any extreme values present in Weekly Sales?

boxplot(data$Weekly_Sales)

Interpretation:
The boxplot shows the presence of extreme values, indicating that some sales values are significantly higher than others.

2.6 Q6: How can Weekly Sales be categorized into High and Low performance?

data$Sales_Category <- ifelse(data$Weekly_Sales > mean(data$Weekly_Sales), "High", "Low")
table(data$Sales_Category)
## 
## High  Low 
## 2876 3559

Interpretation:
Weekly Sales are divided into High and Low categories based on average sales, helping to simplify performance analysis.

2.7 Q7: Which records show high sales performance above average?

high_sales <- subset(data, Weekly_Sales > mean(data$Weekly_Sales))
head(high_sales)
## # A tibble: 6 × 9
##   Store Date       Weekly_Sales Holiday_Flag Temperature Fuel_Price   CPI
##   <dbl> <date>            <dbl>        <dbl>       <dbl>      <dbl> <dbl>
## 1     1 2010-02-05     1643691.            0        42.3       2.57  211.
## 2     1 2010-02-12     1641957.            1        38.5       2.55  211.
## 3     1 2010-02-19     1611968.            0        39.9       2.51  211.
## 4     1 2010-02-26     1409728.            0        46.6       2.56  211.
## 5     1 2010-03-05     1554807.            0        46.5       2.62  211.
## 6     1 2010-03-12     1439542.            0        57.8       2.67  211.
## # ℹ 2 more variables: Unemployment <dbl>, Sales_Category <chr>

Interpretation:
The filtered records show observations where Weekly Sales are above average, highlighting strong performance cases.

2.8 Q8: How does the average Weekly Sales vary across different stores?

aggregate(Weekly_Sales ~ Store, data = data, mean)
##    Store Weekly_Sales
## 1      1    1555264.4
## 2      2    1925751.3
## 3      3     402704.4
## 4      4    2094713.0
## 5      5     318011.8
## 6      6    1564728.2
## 7      7     570617.3
## 8      8     908749.5
## 9      9     543980.6
## 10    10    1899424.6
## 11    11    1356383.1
## 12    12    1009001.6
## 13    13    2003620.3
## 14    14    2020978.4
## 15    15     623312.5
## 16    16     519247.7
## 17    17     893581.4
## 18    18    1084718.4
## 19    19    1444999.0
## 20    20    2107676.9
## 21    21     756069.1
## 22    22    1028501.0
## 23    23    1389864.5
## 24    24    1356755.4
## 25    25     706721.5
## 26    26    1002911.8
## 27    27    1775216.2
## 28    28    1323522.2
## 29    29     539451.4
## 30    30     438579.6
## 31    31    1395901.4
## 32    32    1166568.2
## 33    33     259861.7
## 34    34     966781.6
## 35    35     919725.0
## 36    36     373512.0
## 37    37     518900.3
## 38    38     385731.7
## 39    39    1450668.1
## 40    40     964128.0
## 41    41    1268125.4
## 42    42     556403.9
## 43    43     633324.7
## 44    44     302748.9
## 45    45     785981.4

Interpretation:
The Average Weekly Sales differ across stores, indicating variation in store performance.

2.9 Q9: Which stores have the highest average Weekly Sales?

avg_sales <- aggregate(Weekly_Sales ~ Store, data = data, mean)
top_stores <- avg_sales[order(-avg_sales$Weekly_Sales), ]
head(top_stores)
##    Store Weekly_Sales
## 20    20      2107677
## 4      4      2094713
## 14    14      2020978
## 13    13      2003620
## 2      2      1925751
## 10    10      1899425

Interpretation:
The top stores are identified based on highest average Weekly Sales, showing the best-performing locations.

2.10 Q10: How does the presence of holidays affect Weekly Sales?

aggregate(Weekly_Sales ~ Holiday_Flag, data = data, mean)
##   Holiday_Flag Weekly_Sales
## 1            0      1041256
## 2            1      1122888

Interpretation:
Weekly Sales vary between holiday and non-holiday periods, indicating that holidays influence sales performance.

2.11 Q11: How can the difference in Weekly Sales between holiday and non-holiday periods be visualized?

library(ggplot2)
ggplot(data, aes(x = factor(Holiday_Flag), y = Weekly_Sales)) +
  stat_summary(fun = mean, geom = "bar") +
  xlab("Holiday Flag (0 = No, 1 = Yes)") +
  ylab("Average Weekly Sales")

Interpretation:
The bar chart shows a clear difference in average sales between holiday and non-holiday periods.

2.12 Q12: How does Weekly Sales vary across different stores using visualization?

boxplot(Weekly_Sales ~ Store, data = data)

Interpretation:
Weekly Sales vary across stores, showing differences in performance and sales spread using boxplot.

2.13 Q13: What is the frequency distribution of holiday and non-holiday periods?

table(data$Holiday_Flag)
## 
##    0    1 
## 5985  450

Interpretation:
The dataset contains different counts of holiday and non-holiday periods, showing their distribution.

2.14 Q14: How does temperature vary across the dataset?

summary(data$Temperature)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -2.06   47.46   62.67   60.66   74.94  100.14

Interpretation:
Temperature varing across observation indicates changing environmental conditions in the dataset.

2.15 Q15: What pattern is observed between Temperature and Weekly Sales?

plot(data$Temperature, data$Weekly_Sales)

Interpretation:
There is no strong pattern between temperature and Weekly Sales, indicating a weak relationship.

2.16 Q16: How does Fuel Price vary across the dataset?

summary(data$Fuel_Price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.472   2.933   3.445   3.359   3.735   4.468

Interpretation:
Fuel Price shows variation across observations, indicating changes in fuel cost over time.

2.17 Q17: Which records have the highest Weekly Sales values?

top_sales <- data[order(-data$Weekly_Sales), ]
head(top_sales)
## # A tibble: 6 × 9
##   Store Date       Weekly_Sales Holiday_Flag Temperature Fuel_Price   CPI
##   <dbl> <date>            <dbl>        <dbl>       <dbl>      <dbl> <dbl>
## 1    14 2010-12-24     3818686.            0        30.6       3.14  183.
## 2    20 2010-12-24     3766687.            0        25.2       3.14  205.
## 3    10 2010-12-24     3749058.            0        57.1       3.24  127.
## 4     4 2011-12-23     3676389.            0        35.9       3.10  130.
## 5    13 2010-12-24     3595903.            0        34.9       2.85  127.
## 6    13 2011-12-23     3556766.            0        24.8       3.19  130.
## # ℹ 2 more variables: Unemployment <dbl>, Sales_Category <chr>

Interpretation:
The highest sales records are identified, highlighting peak sales periods in the dataset.

2.18 Q18: How are Weekly Sales distributed across the dataset using boxplot?

boxplot(data$Weekly_Sales)

Interpretation:
Weekly Sales show variation with the presence of extreme values, indicating uneven distribution.

3 Conclusion

The analysis shows that Weekly Sales vary significantly across stores and time. Holidays have an impact on sales, while factors like temperature and fuel price show limited influence. The dataset also contains extreme values, indicating uneven sales distribution.