https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales
Record of sales data in three different supermarkets
The crowds in supermarkets these days are noticeable because the demand is becoming more and more growing in developed countries with a large population and therefore consumption. Passionate about everything related to business and product marketing because I I already have a small online sales business that sells cosmetics, my interest focused on this theme.Thus during my work, I will evaluate how certain products behave according to their price in order to find which are the most coveted and also I will focus on the category of person who consumes them the most in households, namely the men and women. It is very important to do this analysis because it is similar to a small market study before locating what consumers appreciate and consume the most according to gender. For young entrepreneurs like me, this is very important as it could help to implement the type of high consumption products in case we want to enter the business field.But also to report on the competition because whoever says high consumption rate also says several producers and/or competing companies. So, during my work, I will manipulate different types of variables, namely categorical and quantitative. Some examples of categorical variables that I will use are named respectively Genre (which describes the gender of the consumer, an important element in the consumer market), Product Line (which names the different types of products), Branch (which corresponds to each supermarket class in three A,B,C). Also I will use some quantitative variables such as Unit price, gross income and many others. For more details on the dataset, you will find above the link which gives access to the detailed dataset (the source).
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
supermarket_sales_Sheet1 <- read_csv("SuperMarket_Sales/supermarket_sales - Sheet1.csv")
## Rows: 1000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Invoice ID, Branch, City, Customer type, Gender, Product line, Dat...
## dbl (8): Unit price, Quantity, Tax 5%, Total, cogs, gross margin percentage...
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(supermarket_sales_Sheet1)
## # A tibble: 6 × 17
## `Invoice ID` Branch City Custo…¹ Gender Produ…² Unit …³ Quant…⁴ Tax 5…⁵ Total
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 750-67-8428 A Yang… Member Female Health… 74.7 7 26.1 549.
## 2 226-31-3081 C Nayp… Normal Female Electr… 15.3 5 3.82 80.2
## 3 631-41-3108 A Yang… Normal Male Home a… 46.3 7 16.2 341.
## 4 123-19-1176 A Yang… Member Male Health… 58.2 8 23.3 489.
## 5 373-73-7910 A Yang… Normal Male Sports… 86.3 7 30.2 634.
## 6 699-14-3026 C Nayp… Normal Male Electr… 85.4 7 29.9 628.
## # … with 7 more variables: Date <chr>, Time <time>, Payment <chr>, cogs <dbl>,
## # `gross margin percentage` <dbl>, `gross income` <dbl>, Rating <dbl>, and
## # abbreviated variable names ¹`Customer type`, ²`Product line`,
## # ³`Unit price`, ⁴Quantity, ⁵`Tax 5%`
supermarket_sales_Sheet1 <- read_csv("SuperMarket_Sales/supermarket_sales - Sheet1.csv")
## Rows: 1000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Invoice ID, Branch, City, Customer type, Gender, Product line, Dat...
## dbl (8): Unit price, Quantity, Tax 5%, Total, cogs, gross margin percentage...
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
smdata<-supermarket_sales_Sheet1
It is important to look and summarize data to make sure that there is no missing data because some datasets have NA as an internal function, others do not.
anyNA(smdata)
## [1] FALSE
#When TRUE is detected there are NA values.Otherwise FALSE.
str(smdata)
## spc_tbl_ [1,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Invoice ID : chr [1:1000] "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr [1:1000] "A" "C" "A" "A" ...
## $ City : chr [1:1000] "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
## $ Customer type : chr [1:1000] "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr [1:1000] "Female" "Female" "Male" "Male" ...
## $ Product line : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit price : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : num [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax 5% : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
## $ Total : num [1:1000] 549 80.2 340.5 489 634.4 ...
## $ Date : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : 'hms' num [1:1000] 13:08:00 10:29:00 13:23:00 20:33:00 ...
## ..- attr(*, "units")= chr "secs"
## $ Payment : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num [1:1000] 522.8 76.4 324.3 465.8 604.2 ...
## $ gross margin percentage: num [1:1000] 4.76 4.76 4.76 4.76 4.76 ...
## $ gross income : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## - attr(*, "spec")=
## .. cols(
## .. `Invoice ID` = col_character(),
## .. Branch = col_character(),
## .. City = col_character(),
## .. `Customer type` = col_character(),
## .. Gender = col_character(),
## .. `Product line` = col_character(),
## .. `Unit price` = col_double(),
## .. Quantity = col_double(),
## .. `Tax 5%` = col_double(),
## .. Total = col_double(),
## .. Date = col_character(),
## .. Time = col_time(format = ""),
## .. Payment = col_character(),
## .. cogs = col_double(),
## .. `gross margin percentage` = col_double(),
## .. `gross income` = col_double(),
## .. Rating = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
#There is no missing value in the data so we can actually move on.
smdata%>%summary()
## Invoice ID Branch City Customer type
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Gender Product line Unit price Quantity
## Length:1000 Length:1000 Min. :10.08 Min. : 1.00
## Class :character Class :character 1st Qu.:32.88 1st Qu.: 3.00
## Mode :character Mode :character Median :55.23 Median : 5.00
## Mean :55.67 Mean : 5.51
## 3rd Qu.:77.94 3rd Qu.: 8.00
## Max. :99.96 Max. :10.00
## Tax 5% Total Date Time
## Min. : 0.5085 Min. : 10.68 Length:1000 Length:1000
## 1st Qu.: 5.9249 1st Qu.: 124.42 Class :character Class1:hms
## Median :12.0880 Median : 253.85 Mode :character Class2:difftime
## Mean :15.3794 Mean : 322.97 Mode :numeric
## 3rd Qu.:22.4453 3rd Qu.: 471.35
## Max. :49.6500 Max. :1042.65
## Payment cogs gross margin percentage gross income
## Length:1000 Min. : 10.17 Min. :4.762 Min. : 0.5085
## Class :character 1st Qu.:118.50 1st Qu.:4.762 1st Qu.: 5.9249
## Mode :character Median :241.76 Median :4.762 Median :12.0880
## Mean :307.59 Mean :4.762 Mean :15.3794
## 3rd Qu.:448.90 3rd Qu.:4.762 3rd Qu.:22.4453
## Max. :993.00 Max. :4.762 Max. :49.6500
## Rating
## Min. : 4.000
## 1st Qu.: 5.500
## Median : 7.000
## Mean : 6.973
## 3rd Qu.: 8.500
## Max. :10.000
Knowing that we have several variables and that we are not going to use them all. We’ll start by filtering out the columns that aren’t needed.
smdata1 <- select(smdata, -`Time`, -`Tax 5%`, -`Invoice ID`, -`gross margin percentage`,-`cogs`)
str(smdata1)
## tibble [1,000 × 12] (S3: tbl_df/tbl/data.frame)
## $ Branch : chr [1:1000] "A" "C" "A" "A" ...
## $ City : chr [1:1000] "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
## $ Customer type: chr [1:1000] "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr [1:1000] "Female" "Female" "Male" "Male" ...
## $ Product line : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit price : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : num [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
## $ Total : num [1:1000] 549 80.2 340.5 489 634.4 ...
## $ Date : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Payment : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ gross income : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
smdata1 %>%
summary()
## Branch City Customer type Gender
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Product line Unit price Quantity Total
## Length:1000 Min. :10.08 Min. : 1.00 Min. : 10.68
## Class :character 1st Qu.:32.88 1st Qu.: 3.00 1st Qu.: 124.42
## Mode :character Median :55.23 Median : 5.00 Median : 253.85
## Mean :55.67 Mean : 5.51 Mean : 322.97
## 3rd Qu.:77.94 3rd Qu.: 8.00 3rd Qu.: 471.35
## Max. :99.96 Max. :10.00 Max. :1042.65
## Date Payment gross income Rating
## Length:1000 Length:1000 Min. : 0.5085 Min. : 4.000
## Class :character Class :character 1st Qu.: 5.9249 1st Qu.: 5.500
## Mode :character Mode :character Median :12.0880 Median : 7.000
## Mean :15.3794 Mean : 6.973
## 3rd Qu.:22.4453 3rd Qu.: 8.500
## Max. :49.6500 Max. :10.000
mean(smdata$'Unit price', na.rm = TRUE)
## [1] 55.67213
smdata1$ProductLine <- format(smdata1$'Product line', format = "%Y")
The first plot globally presents the overview of the different types of products sale on the supermarket
plot1 <- ggplot(smdata1, aes(x = ProductLine)) +
labs(title = "Number of product line Purchases") +
geom_bar(mapping = aes(x= ProductLine, fill=ProductLine)) + scale_fill_brewer(palette = "Spectral") +
theme_minimal() +
xlab("Product Line") +
ylab("Count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
plot1
The second plot is going to present the different modes of payment using by the customers to see which one they use the most.
mode_of_payment <- ggplot(smdata1, aes(x = Payment)) +
labs(title = "Mode of Payment") +
geom_bar(mapping = aes(x= Payment, fill=Payment )) + scale_fill_brewer(palette = "CMRmap") +
theme_minimal() +
xlab("Payment Type") +
ylab("Count")
## Warning in pal_name(palette, type): Unknown palette CMRmap
mode_of_payment
We can see that “Ewallet” is the most used mode of payment followed by cash, and the least used mode of payment in the three stores is the “credit card”
smdata1$ProductLine <- format(smdata1$'Product line', format = "%Y")
# Which gender has purchase the most according to the different types of product line?
smdata1 %>%
filter(ProductLine != "") %>%
group_by(Gender,ProductLine) %>%
summarise(Sum=sum(Quantity))%>%
ggplot(aes(x=reorder(ProductLine,Sum),Sum, fill = Gender, label=Sum))+
geom_bar(stat="identity",position="dodge")+
xlab("Product Line")+
coord_flip()+
ggtitle("Purchases made according to Gender")
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
In general, there is not a big difference between purchases made by men and women. However, based on our plot it seems that there are more purchases made by women than by men in all categories except health and beauty.Which is a little surprising to me.
# Now, we may want to get a general overview of sales for each Branch A,B and C.
#So, we are going to calculate the average sales amount and the number of sales per branch.
smdata3=group_by(smdata1,Branch)
summarize(smdata3,avg_tot=mean(Total),Quantity=n())
## # A tibble: 3 × 3
## Branch avg_tot Quantity
## <chr> <dbl> <int>
## 1 A 312. 340
## 2 B 320. 332
## 3 C 337. 328
Branch C has both the highest total average amount and the least number of sales.
smdata1$Gross_income <- format(smdata1$'gross income', format = "%Y")
smdata1%>%
ggplot() + geom_boxplot(aes(y=Gross_income, group=Branch, fill=Branch)) +
scale_fill_manual(values=c(" Dark Blue","Orange"," Dark Green")) +
ggtitle("Comparison Between Branch and Gross income") +
coord_flip()
There is not much difference in gross income by branches at an average
level.Branch C is slightly higher income that A or B. As observed early
when we calculated the avg_tot, through branch A has slightly higher
sales than the other. C in Naypyitaw is the most profitable branch in
term of gross income.
For each sellers, customer satisfaction is essential because this is what ensures that, in the face of competition, each of the sellers can position themselves on the market as the best.
smdata1$Customer.type <- format(smdata1$'Customer type', format = "%Y")
# And one of the most effective ways to see if a customer is satisfied or not is to see how he rates the services or products he has purchased.
# In my plot I will therefore evaluate this according to the category of customers (as member or normal customer).
smdata <- smdata1
smdata %>%
select(Customer.type, Rating) %>%
group_by(Customer.type, Rating) %>%
summarise(count_rating=n()) %>%
ggplot(aes(x=Rating, y=count_rating, color=Customer.type)) +
geom_smooth(method="loess", se=FALSE) +
ggtitle("Line chart of ratings by type of customers")
## `summarise()` has grouped output by 'Customer.type'. You can override using the
## `.groups` argument.
## `geom_smooth()` using formula = 'y ~ x'
Based on our plot, it was observed that normal customers are more likely to give low ratings. On the other hand, members are more likely to give higher ratings. This could be because member consumers get better customer service and discounts. Then, therefore this increases their satisfaction.
In this step 6, we are going to show a overview of how much customers are likely to spend per shopping by creating an histogram.
#The gender column is passed to the color parameter to see the distribution for males and females separately.The bins set the number of bins
ggplot(smdata)+
geom_histogram(mapping=aes(x=Total, color=Gender),bins=15,
fill='white')+
labs(title = "Histogram of total sales amount by gender") +
xlab("Total") +
ylab("Count") +
theme_minimal()
We can see that the distribution of the total sales amount is highly similar for males and females.
library("lubridate")
## Loading required package: timechange
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
In this step we are going to show the total sales according to the days of the week. # Change Date Format
smdata4 <- smdata1 %>%
mutate(month=month(as.POSIXlt(Date, format="%m/%d/%Y")), day=day(as.POSIXlt(Date, format="%m/%d/%Y"))) #%>%
smdata4 <- smdata4[,-1]
smdata4$Date <- as.Date(smdata4$Date, "%m/%d/%y")
year(smdata4$Date) <- 2019
head(smdata4)
## # A tibble: 6 × 16
## City Custo…¹ Gender Produ…² Unit …³ Quant…⁴ Total Date Payment gross…⁵
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <date> <chr> <dbl>
## 1 Yangon Member Female Health… 74.7 7 549. 2019-01-05 Ewallet 26.1
## 2 Naypy… Normal Female Electr… 15.3 5 80.2 2019-03-08 Cash 3.82
## 3 Yangon Normal Male Home a… 46.3 7 341. 2019-03-03 Credit… 16.2
## 4 Yangon Member Male Health… 58.2 8 489. 2019-01-27 Ewallet 23.3
## 5 Yangon Normal Male Sports… 86.3 7 634. 2019-02-08 Ewallet 30.2
## 6 Naypy… Normal Male Electr… 85.4 7 628. 2019-03-25 Ewallet 29.9
## # … with 6 more variables: Rating <dbl>, ProductLine <chr>, Gross_income <chr>,
## # Customer.type <chr>, month <dbl>, day <int>, and abbreviated variable names
## # ¹`Customer type`, ²`Product line`, ³`Unit price`, ⁴Quantity,
## # ⁵`gross income`
# Add column weekday
smdata6 <- mutate(smdata4, Weekday = wday(Date))
smdata6$Weekday <- factor(smdata6$Weekday, levels=c(1,2,3,4,5,6,7),
labels=c("Sunday","Monday","Tuesday","Wednesday","Thursday",
"Friday","Saturday"))
head(smdata6)
## # A tibble: 6 × 17
## City Custo…¹ Gender Produ…² Unit …³ Quant…⁴ Total Date Payment gross…⁵
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <date> <chr> <dbl>
## 1 Yangon Member Female Health… 74.7 7 549. 2019-01-05 Ewallet 26.1
## 2 Naypy… Normal Female Electr… 15.3 5 80.2 2019-03-08 Cash 3.82
## 3 Yangon Normal Male Home a… 46.3 7 341. 2019-03-03 Credit… 16.2
## 4 Yangon Member Male Health… 58.2 8 489. 2019-01-27 Ewallet 23.3
## 5 Yangon Normal Male Sports… 86.3 7 634. 2019-02-08 Ewallet 30.2
## 6 Naypy… Normal Male Electr… 85.4 7 628. 2019-03-25 Ewallet 29.9
## # … with 7 more variables: Rating <dbl>, ProductLine <chr>, Gross_income <chr>,
## # Customer.type <chr>, month <dbl>, day <int>, Weekday <fct>, and abbreviated
## # variable names ¹`Customer type`, ²`Product line`, ³`Unit price`, ⁴Quantity,
## # ⁵`gross income`
ggplot(smdata6)+
geom_bar(mapping = aes(x=Weekday, color=City), fill='white',
position='dodge')+
ggtitle("Total Sales Per Week")+
xlab("Weekday")+
ylab("Total Sales Per Week")
This plot showed that the highest sales is on the Saturdays as most of the people would prefer to buy groceries during the weekend and when they are able to manage their time. On the contrary, the Mondays are likely to be the lowest in sales, probably because it is the beginning of the working week. Interestingly, we can see that Tuesday is the second highest sales in a week.
https://www.statista.com/topics/1563/supermarkets-in-the-us/#dossierKeyfigures
In order to deepen my knowledge, I extended my research to have information on the American market. Indeed, as everywhere else, supermarkets represent one of the most important channels in the distribution sector, both for consumers who tend to prefer them to other options, and for the value they generate, some big chains are also among the top retailers. In 2021, for example according to Statistica, supermarket sales in the United States exceeded US$765 billion in 2021, more than double the amount generated in previous years. The majority of sales exploded in the food and beverage sector with around 63,000 stores. In terms of employment size in supermarkets, most are also doing well, with some having less than five employees and others more than 500 employees. However, one of the negatives that I took away from this article is that e-commerce in the total turnover of supermarkets in the food and beverage sector, despite its growth, remains low. Supermarkets continue to operate and focus on physical stores to generate the bulk of their revenue.
Being a consumer and member in several stores, this visualization meant a lot to me because as I said above it business domain is a field of activity that interests me and in which I wish to integrate myself as a seller (even if I already have a small business online). Through this visualization, especially the plot that highlights preferences by gender, I was surprised to find that there was not a huge difference between the purchases made by men and those made by women. The most notable difference was that between “Health and Beauty” products in which men are the largest consumer. Women are often given the label of consuming more beauty products, although this is not always the case. According to our plot, men spend more than women on health and beauty products. The other part that caught my attention was about consumption according days of the week. Even though it was for me the most difficult plot to complete, I found it interesting to see that the purchases according to the days of the week could vary so much. It is clear that weekend days are the days with the highest consumption rates because at those times consumers have more free time and are better able to manage their schedules,and shop but I was surprised to see that certain days of the week can also be classified as medium consumption days such as Tuesdays, for example. As a small online business owner, this particular article gave me some ideas for strategies based on which days I might introduce new products or possibly do promotions with a 50/100 chance that a customer visits my website.Regarding my work, what I really wanted to include is the comparison of branches according to their sales, but for my personal research I will do it and if possible share it with you.