About the Data

This dataset was sourced from Aung Pyae on Kaggle, who obtained it from a supermarket chain in Myanmar. This data is filled with sales components in 3 supemarket branches and includes 17 variables.

supermarket <- read.csv("supermarket_sales.csv")

Attribute information

Invoice id: Computer generated sales slip invoice identification number

Branch: Branch of supercenter (3 branches are available identified by A, B and C).

City: Location of supercenters

Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.

Gender: Gender type of customer

Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel

Unit price: Price of each product in $

Quantity: Number of products purchased by customer

Tax: 5% tax fee for customer buying

Total: Total price including tax

Date: Date of purchase (Record available from January 2019 to March 2019)

Time: Purchase time (10am to 9pm)

Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)

COGS: Cost of goods sold

Gross margin percentage: Gross margin percentage

Gross income: Gross income

Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

Data Inspection

head(supermarket)

tail(supermarket)

dim(supermarket)

## [1] 1000   17

names(supermarket)

##  [1] "Invoice.ID"              "Branch"                 
##  [3] "City"                    "Customer.type"          
##  [5] "Gender"                  "Product.line"           
##  [7] "Unit.price"              "Quantity"               
##  [9] "Tax.5."                  "Total"                  
## [11] "Date"                    "Time"                   
## [13] "Payment"                 "cogs"                   
## [15] "gross.margin.percentage" "gross.income"           
## [17] "Rating"

From our inspection we can conclude : * supermarket data contain 1000 of rows and 17 of coloumns * Each of column name : “Invoice.ID”, “Branch”, “City”, “Customer.type”, “Gender”, “Product.line”, “Unit.price”,“Quantity”, “Tax.5”, “Total”, “Date”, “Time” , “Payment”, “cogs” , “gross.margin.percentage”, “gross.income”, “Rating”.

Data Cleansing

str(supermarket)

## 'data.frame':    1000 obs. of  17 variables:
##  $ Invoice.ID             : Factor w/ 1000 levels "101-17-6199",..: 815 143 654 19 340 734 316 265 703 727 ...
##  $ Branch                 : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
##  $ City                   : Factor w/ 3 levels "Mandalay","Naypyitaw",..: 3 2 3 3 3 2 3 2 3 1 ...
##  $ Customer.type          : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
##  $ Gender                 : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
##  $ Product.line           : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax.5.                 : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...
##  $ Date                   : Factor w/ 89 levels "1/1/2019","1/10/2019",..: 27 88 82 20 58 77 49 48 2 44 ...
##  $ Time                   : Factor w/ 506 levels "10:00","10:01",..: 147 24 156 486 30 394 215 78 342 160 ...
##  $ Payment                : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...

The feature invoice id is of no use . So let us drop the feature. We also need to check if there are any missing values in the data set. Additionally lets take a look at the summary statistics of the same. Additionally, lets convert the Date to a standardized format.

supermarket_clean <- supermarket[,-1]
supermarket_clean$Date <- as.Date(supermarket_clean$Date, "%m/%d/%y")
year(supermarket_clean$Date) <- 2019
sapply(supermarket_clean, function(x) sum(is.na(x)))

##                  Branch                    City           Customer.type 
##                       0                       0                       0 
##                  Gender            Product.line              Unit.price 
##                       0                       0                       0 
##                Quantity                  Tax.5.                   Total 
##                       0                       0                       0 
##                    Date                    Time                 Payment 
##                       0                       0                       0 
##                    cogs gross.margin.percentage            gross.income 
##                       0                       0                       0 
##                  Rating 
##                       0

summary(supermarket_clean)

##  Branch         City     Customer.type    Gender   
##  A:340   Mandalay :332   Member:501    Female:501  
##  B:332   Naypyitaw:328   Normal:499    Male  :499  
##  C:328   Yangon   :340                             
##                                                    
##                                                    
##                                                    
##                                                    
##                  Product.line   Unit.price       Quantity         Tax.5.       
##  Electronic accessories:170   Min.   :10.08   Min.   : 1.00   Min.   : 0.5085  
##  Fashion accessories   :178   1st Qu.:32.88   1st Qu.: 3.00   1st Qu.: 5.9249  
##  Food and beverages    :174   Median :55.23   Median : 5.00   Median :12.0880  
##  Health and beauty     :152   Mean   :55.67   Mean   : 5.51   Mean   :15.3794  
##  Home and lifestyle    :160   3rd Qu.:77.94   3rd Qu.: 8.00   3rd Qu.:22.4453  
##  Sports and travel     :166   Max.   :99.96   Max.   :10.00   Max.   :49.6500  
##                                                                                
##      Total              Date                 Time            Payment   
##  Min.   :  10.68   Min.   :2019-01-01   14:42  :  7   Cash       :344  
##  1st Qu.: 124.42   1st Qu.:2019-01-24   19:48  :  7   Credit card:311  
##  Median : 253.85   Median :2019-02-13   17:38  :  6   Ewallet    :345  
##  Mean   : 322.97   Mean   :2019-02-14   10:11  :  5                    
##  3rd Qu.: 471.35   3rd Qu.:2019-03-08   11:40  :  5                    
##  Max.   :1042.65   Max.   :2019-03-30   11:51  :  5                    
##                                         (Other):965                    
##       cogs        gross.margin.percentage  gross.income         Rating      
##  Min.   : 10.17   Min.   :4.762           Min.   : 0.5085   Min.   : 4.000  
##  1st Qu.:118.50   1st Qu.:4.762           1st Qu.: 5.9249   1st Qu.: 5.500  
##  Median :241.76   Median :4.762           Median :12.0880   Median : 7.000  
##  Mean   :307.59   Mean   :4.762           Mean   :15.3794   Mean   : 6.973  
##  3rd Qu.:448.90   3rd Qu.:4.762           3rd Qu.:22.4453   3rd Qu.: 8.500  
##  Max.   :993.00   Max.   :4.762           Max.   :49.6500   Max.   :10.000  
##

Summary

Dari sumarry diatas dapat terlihat bahwa jumlah transaksi terbesar ada pada cabang A
Kota yang memiliki jumlah transaksi terbesar berada di kota Yangon
Customers with member categories more than normal categories
Dari ke tiga cabang supermarket pelanggan wanita lebih banyak dari pria
Selama tiga bulan, tiga produk yang memiliki jumlah penjualan tertinggi ada pada kategori Fashion accessories, Food and beverages dan kategori Electronic accessories
Harga stuan per unit, dari semua kategori produk di ketiga cabang berkisar di 55.67, ter rendah di 10.08 dan tertinggi di 99.96
Dari ke tiga cabang pesanan kuantitas maksimum adalah 10 dalam 3 bulan tetapi rata-rata di 5.51

supermarket_clean <- supermarket_clean %>% 
  mutate(revenue = Unit.price*Quantity) %>% 
  group_by(City)
supermarket_clean

Exploratory Data & Visualization

Min - Max Sales Product Catagory

plot_supermarket <- supermarket_clean %>% 
                ggplot(aes(x = City,
                           y = Total,
                           fill = Product.line)) +
               facet_wrap(~Customer.type) + 
               theme_classic() +
               geom_col(position = "dodge")+
               scale_y_continuous(limits = c(0, 1200))
  
ggplotly(plot_supermarket)

## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Dari bar plot diatas dapat terlihat pelanggan ber tipe Member dapat membeli satu produk tertinggi di kategori Fashion accsories pada kota Naypitaw dan terendah di kategori Health and beauty pada kota Yangon, dan pelanggan ber tipe Normal dapat membeli satu produk tertinggi di kategori Fashion accessories pada kota Yangon dan ter rendah di kategori Food and beverages pada kota Mandalay.

Total Sales in all cities

supermarket_total <- supermarket_clean %>% 
  select(City,Total) %>% 
  group_by(City) %>% 
  summarise(Total = sum(Total)) %>% 
  arrange(-Total)

## `summarise()` ungrouping output (override with `.groups` argument)

supermarket_total

plot_supermarket_total <- supermarket_total %>% 
                ggplot(aes(x = City,
                           y = Total,
                           fill = City)) +
               theme_pander() +
               geom_col(position = "dodge") +
               scale_y_continuous(limits = c(0, 120000))
  
ggplotly(plot_supermarket_total)

liatlah plot diatas, plot tersebut menunjukan kota Naypitaw lah yang memberikan total penjualan ter tinggi dan ter rendah pada kota Mandalay.

Relationship between Branch and Ratings

bxpBR <- supermarket_clean %>%  
          ggplot(data = supermarket_clean, mapping = aes(x = Branch, y = Rating)) + 
          geom_boxplot(notch = TRUE, mapping = aes(fill = Branch)) +
          theme_classic() + ggtitle("Box Plot Showing Relationship between Branch and Ratings") +
          xlab("Branch")+ ylab("Rating") +
          geom_hline(mapping = aes(yintercept = 7.1), linetype = "dashed") + 
          geom_hline(mapping = aes(yintercept = 6.7), linetype = "dashed") +
          geom_text(mapping = aes(x = "A", y = 7.5, label = "Mandalay")) +
          geom_text(mapping = aes(x = "B", y = 7.5, label = "Naypyitaw")) + 
          geom_text(mapping = aes(x = "C", y = 7.5, label = "Yangon")) +
          scale_y_continuous(breaks = c(4,5.3,5.6,6.7,7.1,8.2,8.5,10))
ggplotly(bxpBR)

Boxplot diatas menunjukan bahwa distribusi rating pada kota Mandalay dan Yangon memiliki distribusi yang sama, tersebar di Q1:5.6 - Q3:8.5 dan memiliki nilai tengah di 7.1, sedangkan kota Naypyitaw jauh lebih rendah di banding 2 kota yang lain, persebaran nya di Q1:5.3 - Q3:8.2 dan memiliki nilai tengah di 6.7.

Distribution of Customers in Each Branch Based on Gender

dist_cust <- ggplot(data = supermarket_clean, mapping = aes(x = Branch)) + 
             geom_bar(mapping = aes(fill = Customer.type)) + theme_linedraw() + 
             ggtitle("Distribution of Customers in Each Branch Based on Gender") +
             xlab("Branch") + ylab("Number of Customers") + facet_wrap(supermarket_clean$Gender) + 
             scale_y_continuous(breaks = c(0, 25, 50, 75, 100, 125, 150, 175)) 
ggplotly(dist_cust)

Dari plot diatas kita dapat mengetahui jumlah persebaran jenis pelanggan berdasarkan jenis kelamin, dapat terlihat dari ke 3 Branch bahwa Wanita lebih banyak menjadi pelanggan Member ketimbang Pria.

Distribution of Customers in Each Branch Based on Payment

 ggplot(data = supermarket_clean, aes(x = Quantity,
                             y = Customer.type,
                             fill = Payment
                             )) +
                  ggtitle("A")+
                  theme_pander()+
                  geom_col(position = "stack") +
  scale_x_continuous(limits = c(0, 2500))

## Warning: Removed 94 rows containing missing values (geom_col).

Dapat terlihat dari plot diatas bahwa tipe pelanggan Member, cinderung membeli produk menggunakan Credit card sedangkan tipe pelanggan Normal lebih cinderung menggunakan Ewallet dan Cash.

cust_type_payment <- supermarket_clean %>% 
                     ggplot(data = supermarket_clean, mapping = aes(x = Branch)) + 
                     geom_bar(mapping = aes(fill = Customer.type)) + 
                     theme_linedraw() + ggtitle("Distribution of Customers in Each Branch Based on Payment Mode") + 
                     xlab("Branch") + ylab("Number of Customers") 
                     cust_type_payment + facet_wrap(supermarket_clean$Payment) +
                     scale_y_continuous(breaks = c(0, 25, 50, 75, 100, 125, 150, 175)) + 
                     labs(caption = "Cash(Members) - 56/53/59 \n CreditCard(Members) - 49/63/60 \n Ewallet(Members) - 62/49/50")

ggplotly(cust_type_payment)

plot diatas menunjukan jumlah distribusi tipe pelanggan dari setiap Branch.

Total Sales

#create new theme
theme_a <- theme(legend.key = element_rect(fill="black"),
           legend.background = element_rect(color="white", fill="#263238"),
           plot.subtitle = element_text(size=6, color="white"),
           panel.background = element_rect(fill="#dddddd"),
           panel.border = element_rect(fill=NA),
           panel.grid.minor.x = element_blank(),
           panel.grid.major.x = element_blank(),
           panel.grid.major.y = element_line(color="darkgrey", linetype=2),
           panel.grid.minor.y = element_blank(),
           plot.background = element_rect(fill="#263238"),
           text = element_text(color="white"),
           axis.text = element_text(color="white")
           )

Total Sales Per Month Based on Branches

Jan_sales <- supermarket_clean %>% mutate(month = month(Date)) %>% filter(month == 1)
salesplot_per_month1 <- ggplot(data = Jan_sales, mapping = aes(x = Branch, y = Total)) +
      geom_boxplot(notch = TRUE, mapping = aes(fill = Branch), outlier.color = "red", show.legend = FALSE) +
      theme_linedraw() + ggtitle("January Sales") + 
      xlab("Branches") + ylab("Total Sales")

Feb_sales <- supermarket_clean %>% mutate(month = month(Date)) %>% filter(month == 2)
salesplot_per_month2 <- ggplot(data = Feb_sales, mapping = aes(x = Branch, y = Total)) + 
      geom_boxplot(notch = TRUE, mapping = aes(fill = Branch), outlier.color = "red", show.legend = FALSE) +
      theme_linedraw() + ggtitle("February Sales") +
      xlab("Branches") + ylab("Total Sales")

Mar_sales <- supermarket_clean %>% mutate(month = month(Date)) %>% filter(month == 3)
salesplot_per_month3 <- ggplot(data = Mar_sales, mapping = aes(x = Branch, y = Total)) + 
      geom_boxplot(notch = TRUE, mapping = aes(fill = Branch), outlier.color = "red") +
      theme_linedraw() + ggtitle("March Sales") + 
      xlab("Branches") + ylab("Total Sales")

grid.arrange(salesplot_per_month1,salesplot_per_month2,salesplot_per_month3)

plot diatas menjelaskan sebaran jumlah penjualan pada bulan januari sampei maret di tiga cabang berbeda, dapat dilihat bahwa penjualan Branch C (Naypyitaw) cinderung stabil dan mengalami sedikit kenaikan pada bulan maret, berbeda dengan Branch A (Yangon) yang mengalami penurunan tiap bulan nya begitu pula dengan Branch B (Mandalay) .

Total Sales Per Month Based on Product Line

     bxp1 <- ggplot(data = Jan_sales, mapping = aes(x = Product.line, y = Total)) + 
             geom_boxplot(mapping = aes(fill = Product.line), outlier.color = "red", show.legend = FALSE) +
             scale_x_discrete(labels = c("Elc", "Fsh", "Fod", "Hel", "Hme", "Spr")) +
             theme_linedraw() + ggtitle("January Sales Based on Product Lines") +
             xlab("Product Lines") + ylab("Total Sales")

     bxp2 <- ggplot(data = Feb_sales, mapping = aes(x = Product.line, y = Total)) + 
             geom_boxplot(mapping = aes(fill = Product.line), outlier.color = "red", show.legend = FALSE) +
             scale_x_discrete(labels = c("Elc", "Fsh", "Fod", "Hel", "Hme", "Spr")) +
             theme_linedraw() + ggtitle("February Sales Based on Product Lines") +
             xlab("Product Lines") + ylab("Total Sales")

     bxp3 <- ggplot(data = Mar_sales, mapping = aes(x = Product.line, y = Total))+ 
             geom_boxplot(mapping = aes(fill = Product.line), outlier.color = "red") +
             scale_x_discrete(labels = c("Elc", "Fsh", "Fod", "Hel", "Hme", "Spr")) +
             theme_linedraw() + ggtitle("March Sales Based on Product Lines") + 
             xlab("Product Lines") + ylab("Total Sales")

grid.arrange(bxp1,bxp2,bxp3)

plot diatas menunjukan Jumlah penjualan tiap produk pada tiap bulan, dapat disimpulkan bahwa penjualan tertinggi pada kategori produk Food and beverages dikarnakan kemungkinan pada bulan maret terdapat banyak outlier Total di sekitar 750 - 900, lalu disusul kedua di kategori Sports and travel dilihat dari distribusi box nya konstan di 2 bulan pertama dan naik pada bulan terakhir.

supermarket_total_product <- supermarket_clean %>% 
  select(Product.line, Total) %>% 
  group_by(Product.line) %>% 
  summarise(Total_Sales = sum(Total)) %>% 
  arrange(-Total_Sales)

## Adding missing grouping variables: `City`

## `summarise()` ungrouping output (override with `.groups` argument)

supermarket_total_product

Total Quantity per Product.line

supermarket_total_Q <- supermarket_clean %>% 
  select(Product.line, Quantity) %>% 
  group_by(Product.line) %>% 
  summarise(Total_Quantity = sum(Quantity)) %>% 
  arrange(-Total_Quantity)

## Adding missing grouping variables: `City`

## `summarise()` ungrouping output (override with `.groups` argument)

supermarket_total_Q

Tabel diatas berisikan jumlah keseluruhan dari tiap kategori product yang terjual di tiga cabang.

plot_supermarket_totalQ <- supermarket_total_Q %>% 
                ggplot(aes(x = Product.line,
                           y = Total_Quantity,
                           fill = Product.line)) +
               theme_a +
               geom_col(position = "dodge")
              # scale_y_continuous(limits = c(0, 1200))
  
ggplotly(plot_supermarket_totalQ)

Terlihat diatas bahwa dari ke tiga Branch produk yang paling banyak di jual merupakan produk berkategori Electonic accessories, dapat disimpulkan bahwa tidak berarti bahwa Quantity suatu produk banyak terjual akan mengasilkan Total sales yang besar juga.

supermarket_Naypyitaw <- supermarket_clean[supermarket_clean$City == "Naypyitaw",]
supermarket_Naypyitaw

Total Quantity Based on Cities

supermarket_total_product_Naypyitaw <- supermarket_Naypyitaw %>% 
  select(Product.line, Quantity) %>% 
  group_by(Product.line) %>% 
  summarise(Total_Quantity = sum(Quantity)) %>% 
  arrange(-Total_Quantity)

## Adding missing grouping variables: `City`

## `summarise()` ungrouping output (override with `.groups` argument)

supermarket_total_product_Naypyitaw

plot_supermarket_totalQC <- supermarket_total_product_Naypyitaw %>% 
                ggplot(aes(x = Product.line,
                           y = Total_Quantity,
                           fill = Product.line)) +
              theme_a +
               geom_col(position = "dodge")
              # scale_y_continuous(limits = c(0, 1200))
  
ggplotly(plot_supermarket_totalQC)

supermarket_Yangon<- supermarket_clean[supermarket_clean$City == "Yangon",]
supermarket_Yangon

supermarket_total_product_Yangon <- supermarket_Yangon %>% 
  select(Product.line, Quantity) %>% 
  group_by(Product.line) %>% 
  summarise(Total_Quantity = sum(Quantity)) %>% 
  arrange(-Total_Quantity)

## Adding missing grouping variables: `City`

## `summarise()` ungrouping output (override with `.groups` argument)

supermarket_total_product_Yangon

plot_supermarket_totalQA <- supermarket_total_product_Yangon %>% 
                ggplot(aes(x = Product.line,
                           y = Total_Quantity,
                           fill = Product.line)) +
               theme_a +
               geom_col(position = "dodge")
              # scale_y_continuous(limits = c(0, 1200))
  
ggplotly(plot_supermarket_totalQA)

supermarket_Mandalay <- supermarket_clean[supermarket_clean$City == "Mandalay",]
supermarket_Mandalay

supermarket_total_product_Mandalay <- supermarket_Mandalay %>% 
  select(Product.line, Quantity) %>% 
  group_by(Product.line) %>% 
  summarise(Total_Quantity = sum(Quantity)) %>% 
  arrange(-Total_Quantity)

## Adding missing grouping variables: `City`

## `summarise()` ungrouping output (override with `.groups` argument)

supermarket_total_product_Mandalay

plot_supermarket_totalQB <- supermarket_total_product_Mandalay %>% 
                ggplot(aes(x = Product.line,
                           y = Total_Quantity,
                           fill = Product.line)) +
               theme_a +
               geom_col(position = "dodge")
              # scale_y_continuous(limits = c(0, 1200))
  
ggplotly(plot_supermarket_totalQB)

Relationship for the Total Revenue per day

total_revenue_per_day <- data.frame(xtabs(formula=revenue~Date, data = supermarket_clean))
total_revenue_per_day$Date <- as.Date(total_revenue_per_day$Date)

plotR <- total_revenue_per_day %>% 
       ggplot(data = total_revenue_per_day, mapping = aes(x = Date, y = Freq)) + 
       geom_line() +
       geom_point() +
       theme_pander() + 
       ggtitle("Time Series Relationship for the Total Revenue per day") + 
       xlab("Date")+ ylab("Total revenue Per Day")
ggplotly(plotR)

Dari terlihat dari plot diatas bahwa setiap tanggal 4, 18 dan 13 selalu mengalami penurunan pendapatan dan terparah di tanggal 13 bulan Februari.

Relationship for the Total Revenue per day in all Branch

A <- supermarket_clean %>% filter(Branch == "A")
     revenue_A <- data.frame(xtabs(formula = revenue~Date, data = A))
revenue_A$Date <- as.Date(revenue_A$Date)
B <- supermarket_clean %>% filter(Branch == "B")
     revenue_B <- data.frame(xtabs(formula = revenue~Date, data = B))
revenue_B$Date <- as.Date(revenue_B$Date)
C <- supermarket_clean %>% filter(Branch == "C")
     revenue_C <- data.frame(xtabs(formula = revenue~Date, data = C))
revenue_C$Date <- as.Date(revenue_C$Date)

plotA <- ggplot(data = revenue_A, mapping = aes(x = Date, y = Freq)) +
         geom_line() + theme_linedraw() + 
         ggtitle("Time Series Relationship for the Total Revenue per day in Branch A") + 
        xlab("Date")+ ylab("Total Revenue Per Day")
plotB <- ggplot(data = revenue_B, mapping = aes(x = Date, y = Freq)) +
         geom_line() + theme_linedraw() + 
         ggtitle("Time Series Relationship for the Total Revenue per day in Branch B") +
         xlab("Date")+ ylab("Total Revenue Per Day")
plotC <- ggplot(data = revenue_C, mapping = aes(x = Date, y = Freq)) +
         geom_line() + theme_linedraw() + 
         ggtitle("Time Series Relationship for the Total Revenue per day in Branch C") +
         xlab("Date")+ ylab("Total Revenue Per Day")

grid.arrange(plotA, plotB, plotC)

Data Visualization for Insight Analysis

Sandy Putra Utama

2020-11-09