Source and topic of the data

https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales

TOPIC: SUPERMARKET SALES

Record of sales data in three different supermarkets

Let’s me introduce my dataset!

The crowds in supermarkets these days are noticeable because the demand is becoming more and more growing in developed countries with a large population and therefore consumption. Passionate about everything related to business and product marketing because I I already have a small online sales business that sells cosmetics, my interest focused on this theme.Thus during my work, I will evaluate how certain products behave according to their price in order to find which are the most coveted and also I will focus on the category of person who consumes them the most in households, namely the men and women. It is very important to do this analysis because it is similar to a small market study before locating what consumers appreciate and consume the most according to gender. For young entrepreneurs like me, this is very important as it could help to implement the type of high consumption products in case we want to enter the business field.But also to report on the competition because whoever says high consumption rate also says several producers and/or competing companies. So, during my work, I will manipulate different types of variables, namely categorical and quantitative. Some examples of categorical variables that I will use are named respectively Genre (which describes the gender of the consumer, an important element in the consumer market), Product Line (which names the different types of products), Branch (which corresponds to each supermarket class in three A,B,C). Also I will use some quantitative variables such as Unit price, gross income and many others. For more details on the dataset, you will find above the link which gives access to the detailed dataset (the source).

Load the libraries and import the dataset

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readr)
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(dplyr)
supermarket_sales_Sheet1 <- read_csv("SuperMarket_Sales/supermarket_sales - Sheet1.csv")
## Rows: 1000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): Invoice ID, Branch, City, Customer type, Gender, Product line, Dat...
## dbl  (8): Unit price, Quantity, Tax 5%, Total, cogs, gross margin percentage...
## time (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(supermarket_sales_Sheet1)
## # A tibble: 6 × 17
##   `Invoice ID` Branch City  Custo…¹ Gender Produ…² Unit …³ Quant…⁴ Tax 5…⁵ Total
##   <chr>        <chr>  <chr> <chr>   <chr>  <chr>     <dbl>   <dbl>   <dbl> <dbl>
## 1 750-67-8428  A      Yang… Member  Female Health…    74.7       7   26.1  549. 
## 2 226-31-3081  C      Nayp… Normal  Female Electr…    15.3       5    3.82  80.2
## 3 631-41-3108  A      Yang… Normal  Male   Home a…    46.3       7   16.2  341. 
## 4 123-19-1176  A      Yang… Member  Male   Health…    58.2       8   23.3  489. 
## 5 373-73-7910  A      Yang… Normal  Male   Sports…    86.3       7   30.2  634. 
## 6 699-14-3026  C      Nayp… Normal  Male   Electr…    85.4       7   29.9  628. 
## # … with 7 more variables: Date <chr>, Time <time>, Payment <chr>, cogs <dbl>,
## #   `gross margin percentage` <dbl>, `gross income` <dbl>, Rating <dbl>, and
## #   abbreviated variable names ¹​`Customer type`, ²​`Product line`,
## #   ³​`Unit price`, ⁴​Quantity, ⁵​`Tax 5%`
supermarket_sales_Sheet1 <- read_csv("SuperMarket_Sales/supermarket_sales - Sheet1.csv")
## Rows: 1000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): Invoice ID, Branch, City, Customer type, Gender, Product line, Dat...
## dbl  (8): Unit price, Quantity, Tax 5%, Total, cogs, gross margin percentage...
## time (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
smdata<-supermarket_sales_Sheet1 

Clean the dataset

check for N/A in a vector

It is important to look and summarize data to make sure that there is no missing data because some datasets have NA as an internal function, others do not.

anyNA(smdata)
## [1] FALSE
#When TRUE is detected there are NA values.Otherwise FALSE.
str(smdata)
## spc_tbl_ [1,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Invoice ID             : chr [1:1000] "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr [1:1000] "A" "C" "A" "A" ...
##  $ City                   : chr [1:1000] "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
##  $ Customer type          : chr [1:1000] "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr [1:1000] "Female" "Female" "Male" "Male" ...
##  $ Product line           : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit price             : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : num [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax 5%                 : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
##  $ Total                  : num [1:1000] 549 80.2 340.5 489 634.4 ...
##  $ Date                   : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : 'hms' num [1:1000] 13:08:00 10:29:00 13:23:00 20:33:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Payment                : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num [1:1000] 522.8 76.4 324.3 465.8 604.2 ...
##  $ gross margin percentage: num [1:1000] 4.76 4.76 4.76 4.76 4.76 ...
##  $ gross income           : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Invoice ID` = col_character(),
##   ..   Branch = col_character(),
##   ..   City = col_character(),
##   ..   `Customer type` = col_character(),
##   ..   Gender = col_character(),
##   ..   `Product line` = col_character(),
##   ..   `Unit price` = col_double(),
##   ..   Quantity = col_double(),
##   ..   `Tax 5%` = col_double(),
##   ..   Total = col_double(),
##   ..   Date = col_character(),
##   ..   Time = col_time(format = ""),
##   ..   Payment = col_character(),
##   ..   cogs = col_double(),
##   ..   `gross margin percentage` = col_double(),
##   ..   `gross income` = col_double(),
##   ..   Rating = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Summarize and View update vector

#There is no missing value in the data so we can actually move on.
smdata%>%summary()
##   Invoice ID           Branch              City           Customer type     
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Gender          Product line         Unit price       Quantity    
##  Length:1000        Length:1000        Min.   :10.08   Min.   : 1.00  
##  Class :character   Class :character   1st Qu.:32.88   1st Qu.: 3.00  
##  Mode  :character   Mode  :character   Median :55.23   Median : 5.00  
##                                        Mean   :55.67   Mean   : 5.51  
##                                        3rd Qu.:77.94   3rd Qu.: 8.00  
##                                        Max.   :99.96   Max.   :10.00  
##      Tax 5%            Total             Date               Time         
##  Min.   : 0.5085   Min.   :  10.68   Length:1000        Length:1000      
##  1st Qu.: 5.9249   1st Qu.: 124.42   Class :character   Class1:hms       
##  Median :12.0880   Median : 253.85   Mode  :character   Class2:difftime  
##  Mean   :15.3794   Mean   : 322.97                      Mode  :numeric   
##  3rd Qu.:22.4453   3rd Qu.: 471.35                                       
##  Max.   :49.6500   Max.   :1042.65                                       
##    Payment               cogs        gross margin percentage  gross income    
##  Length:1000        Min.   : 10.17   Min.   :4.762           Min.   : 0.5085  
##  Class :character   1st Qu.:118.50   1st Qu.:4.762           1st Qu.: 5.9249  
##  Mode  :character   Median :241.76   Median :4.762           Median :12.0880  
##                     Mean   :307.59   Mean   :4.762           Mean   :15.3794  
##                     3rd Qu.:448.90   3rd Qu.:4.762           3rd Qu.:22.4453  
##                     Max.   :993.00   Max.   :4.762           Max.   :49.6500  
##      Rating      
##  Min.   : 4.000  
##  1st Qu.: 5.500  
##  Median : 7.000  
##  Mean   : 6.973  
##  3rd Qu.: 8.500  
##  Max.   :10.000

Create one data visualization with this dataset

Step1: Filter the dataset

Knowing that we have several variables and that we are not going to use them all. We’ll start by filtering out the columns that aren’t needed.

smdata1 <- select(smdata, -`Time`, -`Tax 5%`, -`Invoice ID`, -`gross margin percentage`,-`cogs`)
str(smdata1)
## tibble [1,000 × 12] (S3: tbl_df/tbl/data.frame)
##  $ Branch       : chr [1:1000] "A" "C" "A" "A" ...
##  $ City         : chr [1:1000] "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
##  $ Customer type: chr [1:1000] "Member" "Normal" "Normal" "Member" ...
##  $ Gender       : chr [1:1000] "Female" "Female" "Male" "Male" ...
##  $ Product line : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit price   : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity     : num [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
##  $ Total        : num [1:1000] 549 80.2 340.5 489 634.4 ...
##  $ Date         : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Payment      : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ gross income : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating       : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...

info verification

smdata1 %>%
  summary()
##     Branch              City           Customer type         Gender         
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product line         Unit price       Quantity         Total        
##  Length:1000        Min.   :10.08   Min.   : 1.00   Min.   :  10.68  
##  Class :character   1st Qu.:32.88   1st Qu.: 3.00   1st Qu.: 124.42  
##  Mode  :character   Median :55.23   Median : 5.00   Median : 253.85  
##                     Mean   :55.67   Mean   : 5.51   Mean   : 322.97  
##                     3rd Qu.:77.94   3rd Qu.: 8.00   3rd Qu.: 471.35  
##                     Max.   :99.96   Max.   :10.00   Max.   :1042.65  
##      Date             Payment           gross income         Rating      
##  Length:1000        Length:1000        Min.   : 0.5085   Min.   : 4.000  
##  Class :character   Class :character   1st Qu.: 5.9249   1st Qu.: 5.500  
##  Mode  :character   Mode  :character   Median :12.0880   Median : 7.000  
##                                        Mean   :15.3794   Mean   : 6.973  
##                                        3rd Qu.:22.4453   3rd Qu.: 8.500  
##                                        Max.   :49.6500   Max.   :10.000
mean(smdata$'Unit price', na.rm = TRUE)
## [1] 55.67213
smdata1$ProductLine <- format(smdata1$'Product line', format = "%Y")

Plot 1

The first plot globally presents the overview of the different types of products sale on the supermarket

plot1 <- ggplot(smdata1, aes(x = ProductLine)) +
  labs(title = "Number of product line Purchases") + 
   geom_bar(mapping = aes(x= ProductLine, fill=ProductLine)) + scale_fill_brewer(palette = "Spectral") +
  theme_minimal() +
  xlab("Product Line") + 
  ylab("Count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
plot1

Plot 2

The second plot is going to present the different modes of payment using by the customers to see which one they use the most.

mode_of_payment <- ggplot(smdata1, aes(x = Payment)) +
  labs(title = "Mode of Payment") +
  geom_bar(mapping = aes(x= Payment, fill=Payment )) + scale_fill_brewer(palette = "CMRmap") +
   theme_minimal() +
  xlab("Payment Type") + 
  ylab("Count")
## Warning in pal_name(palette, type): Unknown palette CMRmap
mode_of_payment

We can see that “Ewallet” is the most used mode of payment followed by cash, and the least used mode of payment in the three stores is the “credit card”

Plot 3:

smdata1$ProductLine <- format(smdata1$'Product line', format = "%Y")
# Which gender has purchase the most according to the different types of product line?
smdata1 %>% 
  filter(ProductLine != "") %>%
  group_by(Gender,ProductLine) %>% 
  summarise(Sum=sum(Quantity))%>% 
  ggplot(aes(x=reorder(ProductLine,Sum),Sum, fill = Gender, label=Sum))+
  geom_bar(stat="identity",position="dodge")+
  xlab("Product Line")+
  coord_flip()+
  ggtitle("Purchases made according to Gender")
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.

In general, there is not a big difference between purchases made by men and women. However, based on our plot it seems that there are more purchases made by women than by men in all categories except health and beauty.Which is a little surprising to me.

Step 4

# Now, we may want to get a general overview of sales for each Branch A,B and C.
#So, we are going to calculate the average sales amount and the number of sales per branch.

smdata3=group_by(smdata1,Branch)
summarize(smdata3,avg_tot=mean(Total),Quantity=n())
## # A tibble: 3 × 3
##   Branch avg_tot Quantity
##   <chr>    <dbl>    <int>
## 1 A         312.      340
## 2 B         320.      332
## 3 C         337.      328

Branch C has both the highest total average amount and the least number of sales.

smdata1$Gross_income <- format(smdata1$'gross income', format = "%Y")

Quick try to check our previous result

smdata1%>%
  ggplot() + geom_boxplot(aes(y=Gross_income, group=Branch, fill=Branch)) +
  scale_fill_manual(values=c(" Dark Blue","Orange"," Dark Green")) +                    
  ggtitle("Comparison Between Branch and Gross income") +
  coord_flip()

There is not much difference in gross income by branches at an average level.Branch C is slightly higher income that A or B. As observed early when we calculated the avg_tot, through branch A has slightly higher sales than the other. C in Naypyitaw is the most profitable branch in term of gross income.

Step 5

Plot 5

For each sellers, customer satisfaction is essential because this is what ensures that, in the face of competition, each of the sellers can position themselves on the market as the best.

smdata1$Customer.type <- format(smdata1$'Customer type', format = "%Y")
# And one of the most effective ways to see if a customer is satisfied or not is to  see how he rates the services or products he has purchased.

# In my plot I will therefore evaluate this according to the category of customers (as member or normal customer).

smdata <- smdata1
smdata %>% 
  select(Customer.type, Rating) %>%
  group_by(Customer.type, Rating) %>%
  summarise(count_rating=n()) %>% 
  ggplot(aes(x=Rating, y=count_rating, color=Customer.type)) + 
  geom_smooth(method="loess", se=FALSE) +
  ggtitle("Line chart of ratings by type of customers")
## `summarise()` has grouped output by 'Customer.type'. You can override using the
## `.groups` argument.
## `geom_smooth()` using formula = 'y ~ x'

Based on our plot, it was observed that normal customers are more likely to give low ratings. On the other hand, members are more likely to give higher ratings. This could be because member consumers get better customer service and discounts. Then, therefore this increases their satisfaction.

Step 6

In this step 6, we are going to show a overview of how much customers are likely to spend per shopping by creating an histogram.

#The gender column is passed to the color parameter to see the distribution for males and females separately.The bins set the number of bins
ggplot(smdata)+
  geom_histogram(mapping=aes(x=Total, color=Gender),bins=15,
                 fill='white')+
 labs(title = "Histogram of total sales amount by gender") + 
  xlab("Total") +
  ylab("Count") +
  theme_minimal()

We can see that the distribution of the total sales amount is highly similar for males and females.

library("lubridate")
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Last step

In this step we are going to show the total sales according to the days of the week. # Change Date Format

smdata4 <- smdata1 %>%
  mutate(month=month(as.POSIXlt(Date, format="%m/%d/%Y")), day=day(as.POSIXlt(Date, format="%m/%d/%Y"))) #%>%
smdata4 <- smdata4[,-1]
smdata4$Date <- as.Date(smdata4$Date, "%m/%d/%y")
year(smdata4$Date) <- 2019

head(smdata4)
## # A tibble: 6 × 16
##   City   Custo…¹ Gender Produ…² Unit …³ Quant…⁴ Total Date       Payment gross…⁵
##   <chr>  <chr>   <chr>  <chr>     <dbl>   <dbl> <dbl> <date>     <chr>     <dbl>
## 1 Yangon Member  Female Health…    74.7       7 549.  2019-01-05 Ewallet   26.1 
## 2 Naypy… Normal  Female Electr…    15.3       5  80.2 2019-03-08 Cash       3.82
## 3 Yangon Normal  Male   Home a…    46.3       7 341.  2019-03-03 Credit…   16.2 
## 4 Yangon Member  Male   Health…    58.2       8 489.  2019-01-27 Ewallet   23.3 
## 5 Yangon Normal  Male   Sports…    86.3       7 634.  2019-02-08 Ewallet   30.2 
## 6 Naypy… Normal  Male   Electr…    85.4       7 628.  2019-03-25 Ewallet   29.9 
## # … with 6 more variables: Rating <dbl>, ProductLine <chr>, Gross_income <chr>,
## #   Customer.type <chr>, month <dbl>, day <int>, and abbreviated variable names
## #   ¹​`Customer type`, ²​`Product line`, ³​`Unit price`, ⁴​Quantity,
## #   ⁵​`gross income`
# Add column weekday
smdata6 <- mutate(smdata4, Weekday = wday(Date))

smdata6$Weekday <- factor(smdata6$Weekday, levels=c(1,2,3,4,5,6,7), 
                         labels=c("Sunday","Monday","Tuesday","Wednesday","Thursday",
                                  "Friday","Saturday"))
head(smdata6)
## # A tibble: 6 × 17
##   City   Custo…¹ Gender Produ…² Unit …³ Quant…⁴ Total Date       Payment gross…⁵
##   <chr>  <chr>   <chr>  <chr>     <dbl>   <dbl> <dbl> <date>     <chr>     <dbl>
## 1 Yangon Member  Female Health…    74.7       7 549.  2019-01-05 Ewallet   26.1 
## 2 Naypy… Normal  Female Electr…    15.3       5  80.2 2019-03-08 Cash       3.82
## 3 Yangon Normal  Male   Home a…    46.3       7 341.  2019-03-03 Credit…   16.2 
## 4 Yangon Member  Male   Health…    58.2       8 489.  2019-01-27 Ewallet   23.3 
## 5 Yangon Normal  Male   Sports…    86.3       7 634.  2019-02-08 Ewallet   30.2 
## 6 Naypy… Normal  Male   Electr…    85.4       7 628.  2019-03-25 Ewallet   29.9 
## # … with 7 more variables: Rating <dbl>, ProductLine <chr>, Gross_income <chr>,
## #   Customer.type <chr>, month <dbl>, day <int>, Weekday <fct>, and abbreviated
## #   variable names ¹​`Customer type`, ²​`Product line`, ³​`Unit price`, ⁴​Quantity,
## #   ⁵​`gross income`
ggplot(smdata6)+ 
  geom_bar(mapping = aes(x=Weekday, color=City), fill='white', 
           position='dodge')+
  ggtitle("Total Sales Per Week")+ 
  xlab("Weekday")+ 
  ylab("Total Sales Per Week")

This plot showed that the highest sales is on the Saturdays as most of the people would prefer to buy groceries during the weekend and when they are able to manage their time. On the contrary, the Mondays are likely to be the lowest in sales, probably because it is the beginning of the working week. Interestingly, we can see that Tuesday is the second highest sales in a week.

Background Research about the topic

https://www.statista.com/topics/1563/supermarkets-in-the-us/#dossierKeyfigures

In order to deepen my knowledge, I extended my research to have information on the American market. Indeed, as everywhere else, supermarkets represent one of the most important channels in the distribution sector, both for consumers who tend to prefer them to other options, and for the value they generate, some big chains are also among the top retailers. In 2021, for example according to Statistica, supermarket sales in the United States exceeded US$765 billion in 2021, more than double the amount generated in previous years. The majority of sales exploded in the food and beverage sector with around 63,000 stores. In terms of employment size in supermarkets, most are also doing well, with some having less than five employees and others more than 500 employees. However, one of the negatives that I took away from this article is that e-commerce in the total turnover of supermarkets in the food and beverage sector, despite its growth, remains low. Supermarkets continue to operate and focus on physical stores to generate the bulk of their revenue.

Being a consumer and member in several stores, this visualization meant a lot to me because as I said above it business domain is a field of activity that interests me and in which I wish to integrate myself as a seller (even if I already have a small business online). Through this visualization, especially the plot that highlights preferences by gender, I was surprised to find that there was not a huge difference between the purchases made by men and those made by women. The most notable difference was that between “Health and Beauty” products in which men are the largest consumer. Women are often given the label of consuming more beauty products, although this is not always the case. According to our plot, men spend more than women on health and beauty products. The other part that caught my attention was about consumption according days of the week. Even though it was for me the most difficult plot to complete, I found it interesting to see that the purchases according to the days of the week could vary so much. It is clear that weekend days are the days with the highest consumption rates because at those times consumers have more free time and are better able to manage their schedules,and shop but I was surprised to see that certain days of the week can also be classified as medium consumption days such as Tuesdays, for example. As a small online business owner, this particular article gave me some ideas for strategies based on which days I might introduce new products or possibly do promotions with a 50/100 chance that a customer visits my website.Regarding my work, what I really wanted to include is the comparison of branches according to their sales, but for my personal research I will do it and if possible share it with you.