Supermarket Sales Analysis

Purpose of Case Study

The company I am working with in this data set is considered a supermarket that carries a large variety of items. I will be determining from this data set with the sales and marketing team whether an overall better shopping experience leads to higher revenue per store location, which departments across the three separate branches are seeing the highest sales volume in order to access expanding product lines, and finally accessing staffing needs based off of average number of transactions per day.

Setting up the Environment for Analysis

We will first want to ensure the right packages are installed by adding tidyverse to this program. We will install and load ‘tidyverse’, which contains GGplot2 for visualizations.

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Attaching Files from Excel

Next, I will be uploading the original data set and a reformatted data set, which showed days of the week. The reformatted data analysis will be used to show days of the week to determine the average amount of sales transactions per day to determine staffing needs.

supermarket_sales <- read_csv("Supermarket_Sales.csv")

## Rows: 1000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): Invoice ID, Branch, City, Customer type, Gender, Product line, Dat...
## dbl  (8): Unit price, Quantity, Tax 5%, Total, cogs, gross margin percentage...
## time (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Supermarket_Sales_Reformat <- read_csv("Supermarket_Sales_Reformat.csv")

## Rows: 1000 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): Invoice ID, Branch, City, Customer type, Gender, Product line, Day...
## dbl  (9): Unit price, Quantity, Tax 5%, Total, Year, cogs, gross margin perc...
## time (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(supermarket_sales)

## Rows: 1,000
## Columns: 17
## $ `Invoice ID`              <chr> "750-67-8428", "226-31-3081", "631-41-3108",…
## $ Branch                    <chr> "A", "C", "A", "A", "A", "C", "A", "C", "A",…
## $ City                      <chr> "Yangon", "Naypyitaw", "Yangon", "Yangon", "…
## $ `Customer type`           <chr> "Member", "Normal", "Normal", "Member", "Nor…
## $ Gender                    <chr> "Female", "Female", "Male", "Male", "Male", …
## $ `Product line`            <chr> "Health and beauty", "Electronic accessories…
## $ `Unit price`              <dbl> 74.69, 15.28, 46.33, 58.22, 86.31, 85.39, 68…
## $ Quantity                  <dbl> 7, 5, 7, 8, 7, 7, 6, 10, 2, 3, 4, 4, 5, 10, …
## $ `Tax 5%`                  <dbl> 26.1415, 3.8200, 16.2155, 23.2880, 30.2085, …
## $ Total                     <dbl> 548.9715, 80.2200, 340.5255, 489.0480, 634.3…
## $ Date                      <chr> "1/5/2019", "3/8/2019", "3/3/2019", "1/27/20…
## $ Time                      <time> 13:08:00, 10:29:00, 13:23:00, 20:33:00, 10:…
## $ Payment                   <chr> "Ewallet", "Cash", "Credit card", "Ewallet",…
## $ cogs                      <dbl> 522.83, 76.40, 324.31, 465.76, 604.17, 597.7…
## $ `gross margin percentage` <dbl> 4.761905, 4.761905, 4.761905, 4.761905, 4.76…
## $ `gross income`            <dbl> 26.1415, 3.8200, 16.2155, 23.2880, 30.2085, …
## $ Rating                    <dbl> 9.1, 9.6, 7.4, 8.4, 5.3, 4.1, 5.8, 8.0, 7.2,…

glimpse(Supermarket_Sales_Reformat)

## Rows: 1,000
## Columns: 19
## $ `Invoice ID`              <chr> "361-85-2571", "291-55-6563", "115-38-7388",…
## $ Branch                    <chr> "A", "A", "C", "A", "B", "C", "B", "A", "B",…
## $ City                      <chr> "Yangon", "Yangon", "Naypyitaw", "Yangon", "…
## $ `Customer type`           <chr> "Normal", "Member", "Member", "Normal", "Mem…
## $ Gender                    <chr> "Female", "Female", "Female", "Female", "Mal…
## $ `Product line`            <chr> "Sports and travel", "Home and lifestyle", "…
## $ `Unit price`              <dbl> 89.48, 34.42, 10.18, 40.23, 97.22, 70.21, 93…
## $ Quantity                  <dbl> 5, 6, 8, 7, 9, 6, 2, 5, 8, 8, 3, 10, 7, 5, 4…
## $ `Tax 5%`                  <dbl> 22.3700, 10.3260, 4.0720, 14.0805, 43.7490, …
## $ Total                     <dbl> 469.7700, 216.8460, 85.5120, 295.6905, 918.7…
## $ `Day of the Week`         <chr> "Saturday", "Saturday", "Saturday", "Saturda…
## $ Date                      <chr> "March 30", "March 30", "March 30", "March 3…
## $ Year                      <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20…
## $ Time                      <time> 10:18:00, 12:45:00, 12:51:00, 13:22:00, 14:…
## $ Payment                   <chr> "Cash", "Ewallet", "Credit card", "Cash", "E…
## $ cogs                      <dbl> 447.40, 206.52, 81.44, 281.61, 874.98, 421.2…
## $ `gross margin percentage` <dbl> 4.761905, 4.761905, 4.761905, 4.761905, 4.76…
## $ `gross income`            <dbl> 22.3700, 10.3260, 4.0720, 14.0805, 43.7490, …
## $ Rating                    <dbl> 7.4, 7.5, 9.5, 9.6, 6.0, 7.4, 5.5, 6.6, 4.6,…

Correlation Between Number of Transactions and Customer Satisfaction Rating

First, I want to determine from the data set if a higher customer satisfaction rating resulted in more sales per branch. I ran a SQL query to determine if that was the case. Then, I exported the results into a Excel spreadsheet and imported it into R for a visualization.

Correlation_Between_Number_of_Transactions_and_Customer_Satisfaction_Rating <- read_csv("/cloud/lib/SuperMarket Sales Analysis Project/Correlation Between Number of Transactions and Customer Satisfaction Rating.csv")

## Rows: 3 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Branch_City
## dbl (2): number_of_transactions, avg_customer_rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I want to simplify this by renaming the spreadsheet to assign it to a variable.

Number_of_Transactions_vs_Customer_Rating <- Correlation_Between_Number_of_Transactions_and_Customer_Satisfaction_Rating

I will use a scattered chart to show the correlation with those two variables between the three branches.

 ggplot(data = Number_of_Transactions_vs_Customer_Rating, aes(x = avg_customer_rating, y = number_of_transactions, color = Branch_City)) + geom_point(stat = "identity") + labs(title="Correlation Between Number of Transactions and Customer Rating",x ="Average Customer Rating", y = "Number of Transactions", color = "Branch, City") + geom_text(aes(label = number_of_transactions), vjust = -1) + xlim(6.3,7.1) + ylim(0,500)

There was a small variance in the number of transactions as well as the customer rating for all three branches. The data that was provided only contained three months of sales transactions. More data going out to a years worth of transaction logs will need to be collected for a more accurate analysis.

Sales Volume by Department

I’ll be pulling the sales revenue along with number of transactions for every department across the three branches to determine the top three departments leading in those categories. I will be using a bar chart for this analysis.

First, I’ll want to import my spreadsheet pulled from SQL.

Sales_Revenue_by_Department <- read_csv("/cloud/lib/SuperMarket Sales Analysis Project/Sales Revenue by Department.csv")

## Rows: 6 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Product_line
## dbl (3): Number_of_Transactions, Number_of_Items_Sold, Total_Revenue_By_Depa...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Next, I’ll use ggplot2 for analysis.

ggplot(data = Sales_Revenue_by_Department, aes(x = Product_line, y = Total_Revenue_By_Department, fill = Number_of_Items_Sold)) + geom_bar(stat = "identity") +  theme(axis.text.x = element_text(angle = 90)) + labs(title = "Sales Revenue by Department", x = "Product Line", y = "Total Revenue by Department", fill = "Number of Items Sold") + geom_text(aes(label = Total_Revenue_By_Department), vjust = 0)

I’ve determined the three top departments are food and beverages, sports and travel, and electronic accessories. Health and beauty products fell sustainably behind in revenue compared to other categories.

Average Amount of Transactions per Day per Branch

Lastly, I will determine the average number of transactions per day per branch to access staffing needs. I will use the data set I pulled summarized in SQL for graphical representation.

Average_Amount_of_Transactions_per_Day_per_Branch <- read_csv("Average Amount of Transactions per Day per Branch.csv")

## Rows: 21 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Branch_City, Day_of_the_Week
## dbl (1): number_of_transactions_per_day
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(Average_Amount_of_Transactions_per_Day_per_Branch)

##  Branch_City        Day_of_the_Week    number_of_transactions_per_day
##  Length:21          Length:21          Min.   :3.000                 
##  Class :character   Class :character   1st Qu.:3.000                 
##  Mode  :character   Mode  :character   Median :4.000                 
##                                        Mean   :3.762                 
##                                        3rd Qu.:4.000                 
##                                        Max.   :5.000

ggplot(Average_Amount_of_Transactions_per_Day_per_Branch, aes(x = Day_of_the_Week, y = number_of_transactions_per_day, fill = Branch_City)) + geom_bar(stat = 'identity') + scale_x_discrete(limits = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")) + theme(axis.text.x = element_text(angle = 90)) + labs(title = "Average Amount of Transactions per Day per Branch", x = "Day of the Week" , y = "Number of Transactions per Day", fill = "Branch,City") + facet_wrap(~Branch_City)

The summary pulled the mean that was 3.7 transactions per day, which can be rounded to four transactions a day to determine staffing needs in every branch.

Recommendations

First, a years worth of sales transactions will need to be collected for the customer satisfaction score. This will give a better reflection on how it correlates with number of sales. I was only able to pull three months of sales log where the variance was insignificant between all three branches.

Second, the top three departments leading in sales revenue as well as number of transactions are food and beverages, sports and travel, and electronic accessories. I recommend expanding the product line in all three of these departments for every single branch. Health and beauty fell significantly behind in revenue compared to the other departments. I recommend cutting back the product line in that department for every branch.

Lastly, the transactions per day fluctuated each day in every branch. The mean showed to be 3.7, which will be rounded up to four. The store hours each day are from 10 AM - 9 PM. I recommend having one cashier at each branch per day with a rotating schedule throughout the week.

Citation

[Aunge Pyae]. ([March 2019]). [Supermarket sales], [Version 3]

[Link to Original Data Set] (https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales)