Group Members

Student ID Student Name
17217922 Farah Husna Binti Ramli (Leader)
S2001169 Nabilah Ahmad Kamal
17120332 Nor Azyra Binti Omar

Introduction

Definition: Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Objective: To test a hypothesis or check assumptions related to different types of problems for business analytics using the supermarket sales data.

Problems/Questions:
1. Descriptive - What is the top payment method based on the product lines?
2. Prediction - When is the best time to offer discounts?
3. Correlation - Is there a relationship between total sales on weekdays vs weekend?


Data Understanding

About the raw data

Dataset Details

Title: The Supermarket Sales Data

Year: 2019

Source: Kaggle: https://www.kaggle.com/aungpyaeap/supermarket-sales

About: The growth of supermarkets in most populated cities are increasing and market competitions are also high. This dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.

Attributes:

  • Tax: 5% tax fee for customer buying
  • Total: Total price including tax
  • Date: Date of purchase (Record available from January 2019 to March 2019)
  • Time: Purchase time (10am to 9pm)
  • Payment: Payment used by customer for purchase (Cash, Credit card and Ewallet)
  • COGS: Cost of goods sold
  • Gross margin percentage: Gross margin percentage
  • Gross income: Gross income
  • Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

Data Overview

#Load the dataset
supermarket_sales <- read.csv("supermarket_sales.csv")


Check the dimension of the database.

suppressWarnings(library(dplyr))
dim(supermarket_sales)
## [1] 1000   17


Preview some of the data to get an overview.

head(supermarket_sales)


Use descriptive analysis to understand the data.

str(supermarket_sales)
## 'data.frame':    1000 obs. of  17 variables:
##  $ Invoice.ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr  "A" "C" "A" "A" ...
##  $ City                   : chr  "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
##  $ Customer.type          : chr  "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr  "Female" "Female" "Male" "Male" ...
##  $ Product.line           : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax.5.                 : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...
##  $ Date                   : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr  "13:08" "10:29" "13:23" "20:33" ...
##  $ Payment                : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
#to show min & max unit price for each of product line
supermarket_sales %>%
  group_by(supermarket_sales$Product.line) %>%
  summarise(
    total_inv = n(),
    min_price = min(Unit.price),
    max_price = max(Unit.price),
    avg_price = mean(Unit.price),
  )

Data Preprocessing

Is a significant step to process and prepare the data for analysis.

Data Cleaning

1- Confirm the class of the loaded dataset is a data frame.

class(supermarket_sales)
## [1] "data.frame"

2- Check for duplication sales transactions based on Invoice_ID.

sum(duplicated(supermarket_sales[,1]))
## [1] 0

3- Check for missing values across all columns in the data frame.

sapply(supermarket_sales,function(x) table(as.character(x) =="")["TRUE"])
##              Invoice.ID.NA                  Branch.NA 
##                         NA                         NA 
##                    City.NA           Customer.type.NA 
##                         NA                         NA 
##                  Gender.NA            Product.line.NA 
##                         NA                         NA 
##              Unit.price.NA                Quantity.NA 
##                         NA                         NA 
##                  Tax.5..NA                   Total.NA 
##                         NA                         NA 
##                    Date.NA                    Time.NA 
##                         NA                         NA 
##                 Payment.NA                    cogs.NA 
##                         NA                         NA 
## gross.margin.percentage.NA            gross.income.NA 
##                         NA                         NA 
##                  Rating.NA 
##                         NA
sapply(supermarket_sales,function(x) sum(is.na(x)))
##              Invoice.ID                  Branch                    City 
##                       0                       0                       0 
##           Customer.type                  Gender            Product.line 
##                       0                       0                       0 
##              Unit.price                Quantity                  Tax.5. 
##                       0                       0                       0 
##                   Total                    Date                    Time 
##                       0                       0                       0 
##                 Payment                    cogs gross.margin.percentage 
##                       0                       0                       0 
##            gross.income                  Rating 
##                       0                       0

4- Convert date from character to datetime.

typeof(supermarket_sales$Date) #check type of object for date
## [1] "character"
supermarket_sales$Date<-as.Date(supermarket_sales$Date, format = "%m/%d/%Y")

5- Remove irrelevant information- only keep columns significant to sales analysis for the business.

supermarket_sales<-subset(supermarket_sales, 
                          select=c("Branch","City","Customer.type","Gender",
                                   "Product.line","Quantity","Total","Date","Time","Payment"))

Data Transformation

1- Create new columns for month and day.

supermarket_sales<-transform(supermarket_sales,Months= months(supermarket_sales$Date))
supermarket_sales<-transform(supermarket_sales,Days= weekdays(supermarket_sales$Date))

2- Re-arrange the columns & assigned it to a new processed dataset.

SS_df_cleaned<-supermarket_sales[,c(1:8,11:12,9:10)]

3- Check the dimension of the transformed database.

dim(SS_df_cleaned) # by rows & columns
## [1] 1000   12

4- Get the details of the transformed database to confirm the changes done.

library(funModeling)  #for analysis
status(SS_df_cleaned)

Data Exploratory

1- Load required library for analysis and visualization.

library(ggplot2)      
library(scales)       
library(RColorBrewer) 

2- Analyzing categorical variables:

freq(SS_df_cleaned$Branch)
freq(SS_df_cleaned$City)
freq(SS_df_cleaned$Customer.type)
freq(SS_df_cleaned$Product.line)
freq(SS_df_cleaned$Payment)

3- Analyzing numerical variables:

profiling_num(SS_df_cleaned)
plot_num(SS_df_cleaned)

4- Analyzing categorical & numerical variables:

#Analysis by Customer type-Normal
SS_df_analysed<-subset(SS_df_cleaned,SS_df_cleaned$Customer.type=="Normal")
normal<-ggplot(data = SS_df_analysed, aes(x=Product.line,y=Total, color=Product.line)) + 
  geom_boxplot()+
  scale_color_brewer(palette="Dark2") + 
  geom_jitter(shape=16, position=position_jitter(0.2))+
  theme(axis.text.x = element_text(angle = 25))+
  labs(title = 'Which product lines do Non-members spend the most',
       y='Total Sales',x='Product Lines')
normal

#Analysis by Customer type-Member
SS_df_analysed<-subset(SS_df_cleaned,SS_df_cleaned$Customer.type=="Member")
member<-ggplot(data = SS_df_analysed, aes(x=Product.line,y=Total, color=Product.line)) + 
  geom_boxplot()+
  scale_color_brewer(palette="Dark2") + 
  geom_jitter(shape=16, position=position_jitter(0.2))+
  theme(axis.text.x = element_text(angle = 25))+
  labs(title = 'Which product lines do Members spend the most',
       y='Total Sales',x='Product Lines')
member

  • The distribution of sales across categories for customers with membership is more equally spread out compare to those of normal customers.

  • Non-members maximum spending across categories are mostly less than or close to $500 only while customers with membership spend the most in health & beauty line.

  • There are big differences in the distributions of sales between non-members and members in fashion accessories, health & beauty and sports & travel product lines.


Data Analysis

Answer different types of business questions for analysis

Q1: Descriptive Problem

What is the top payment method based on the product lines?

Plotting the bar graph for the frequency of each payment methods.

# count Payment and change the count values into percentage

plotdata <- SS_df_cleaned %>% count(Payment) %>%
  mutate(pct = n / sum(n), # count / sum of count
         pctlabel = paste0(round(pct*100), "%")) #percentage form


# plot the bars for Payment in ascending order

ggplot(plotdata, 
       aes(x = reorder(Payment, pct), 
           y = pct)) + 
  geom_bar(stat = "identity",
           fill = "#6fd1d1", 
           color = "black") +
  geom_text(aes(label = pctlabel), 
            vjust=-0.25) +
  scale_y_continuous(labels = percent) +
  theme_minimal() +
  labs(x = "Payment", 
       y = "Frequency", 
       title  = "Payment Mehods")



Plotting the bar graph for the frequency of each payment methods according to the product lines.

# plot the bars for Payment based on the colours of Product.line

plotdata <- SS_df_cleaned %>% count(Payment, Product.line) %>%
  mutate(pct = n / sum(n),
         pctlabel = paste0(round(pct*100), "%"))


ggplot(plotdata, 
       aes(x = Product.line, 
           y = pct,
           fill = Payment)) + 
  geom_bar(stat = "identity",
           position=position_dodge()) +
  scale_fill_brewer(palette="Dark2") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  geom_text(aes(label = pctlabel), 
            vjust=0.5, 
            hjust = 1.5, 
            position = position_dodge(1.0), 
            size=3.5,
            colour = "white") +
  scale_y_continuous(labels = percent) +
  labs(x = " ", 
       y = "Frequency", 
       title  = "Payment methods according to the product lines") +
  coord_flip() #horizontal bar graph


Results:

  • 34% of the customers prefer to use e-wallet as their payment method, another 34% prefer cash, while 31% preferred credit card
  • Customers prefer to use cash when they bought electronic accessories items.
  • Customers prefer to use their credit cards more when they bought for food and beverages items.
  • Customers prefer to use e-wallet when they pay for home and lifestyle products or fashion accessories items.


Q2: Predictive Problem

When is the best time to offer discounts?

Plotting the bar graph for the frequency of the days.

# Days frequency

#count the days
Cdays <- SS_df_cleaned %>% count(Days) 

#plot
ggplot(Cdays, aes(x = reorder(Days, n), y = n)) + 
  geom_bar(stat = "identity",
           fill = "indianred3", 
           color = "black") +
  labs(x = "Days", 
       y = "Frequency", 
       title  = "Frequency of Days")+
  theme_minimal() 



Plotting the bar graph for the frequency of the time using binning method.

# Time frequency

#count the time
Ctime <- SS_df_cleaned %>% select(Time) 

#create tags for binning 
tags <- c("10 am - 12 pm", "12 pm - 3 pm", "3 pm - 5 pm", "5 pm - 7 pm",
          "7 pm - 9 pm") 

#binning the time column
time_bins <- as_tibble(Ctime) %>% 
  mutate(Time_bin = case_when(
    Time >= 10 & Time < 12 ~ tags[1], 
    Time >= 12 & Time < 15 ~ tags[2],
    Time >= 15 & Time < 17 ~ tags[3],
    Time >= 17 & Time < 19 ~ tags[4],
    Time >= 19 & Time < 21 ~ tags[5],
    ))

#count the percentage of the time bins 
plotdata <- time_bins %>% count(Time_bin) %>%
  mutate(pct = n / sum(n),
         pctlabel = paste0(round(pct*100), "%"))

#plot
ggplot(plotdata, 
       aes(x = reorder(Time_bin, -pct), 
           y = pct)) + 
  geom_bar(stat = "identity",
           fill = "#6fd1d1",
           colour = "black") +
  geom_text(aes(label = pctlabel), 
            vjust = -0.25) +
  scale_y_continuous(labels = percent) +
  labs(x = "Time Range", 
       y = "Frequency", 
       title  = "Busy Hour")+
  theme_minimal() 



Plotting the bar graph for the frequency of the days and the time.

# day n time

#add the time bins into the new df
new <- SS_df_cleaned %>% mutate(time_bins)


plotdata <- new %>% count(Days, Time_bin) %>%
  mutate(pct = n / sum(n),
         pctlabel = paste0(round(pct*100), "%"))


ggplot(plotdata, 
       aes(x = Time_bin, 
           y = pctlabel,
           fill = Days)) + 
  geom_bar(stat = "identity",
           position=position_dodge()) +
  scale_fill_brewer(palette="Set2") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Time Range", 
       y = "Frequency", 
       title  = "Busy Hour")


Results:

  • The number of customers during the Saturdays were the highest, while the number of customers on Mondays were the least.
  • It is the busiest during 12pm to 3pm and the least busy time is during 5pm to 7pm.
  • Mondays are the least busy day at the supermarkets.
  • Therefore the best time to offer discounts are:
    • On Mondays
    • 5pm - 7pm on any day
    • 5pm - 7pm on Mondays
    • 3pm - 5pm on Sundays, Tuesdays, Thursdays.


Q3: Associative Problem

Is there a relationship between total sales on weekdays vs weekend?

Plotting a heatmap to visualize relationship between sales on weekdays vs weekend.

library(superheat)

new$Days <- recode(new$Days,
         "Monday" = "Weekdays",
         "Tuesday" = "Weekdays",
         "Wednesday" = "Weekdays", 
         "Thursday" = "Weekdays",
         "Friday" = "Weekdays",
         "Saturday" = "Weekends",
         "Sunday" = "Weekends")



#sales relationship between weekdays and weekends


sales <- new %>% select(Total)

  #create tags for binning
tags <- c("$0-$200", "$200-$400", "$400-$600","$600-$800", ">$800") 

  #binning the sales column
Total_bins <- as_tibble(sales) %>% 
  mutate(Total_bins = case_when(
    Total >= 10 & Total < 200 ~ tags[1],
    Total >= 200 & Total < 400 ~ tags[2],
    Total >= 400 & Total < 600 ~ tags[3],
    Total >= 600 & Total < 800 ~ tags[4],
    Total >= 800  ~ tags[5]
  ))

  # temporary df
temp <- new %>% 
  mutate(Total_bins) %>% 
  count(Days, Total_bins) %>%
  filter(Days == "Weekends") %>%
  mutate(pct = n / sum(n),
         pctlabel = paste0(round(pct*100), "%"))

  #plot 
new %>% 
  mutate(Total_bins) %>% 
  count(Days, Total_bins) %>%
  filter(Days == "Weekdays") %>%
  mutate(pct = n / sum(n),
         pctlabel = paste0(round(pct*100), "%")) %>%
  add_row(temp) %>%
  ggplot(mapping = aes(x = Days, y = Total_bins)) +
  geom_tile(mapping = aes(fill = pct*100)) +
  geom_text(aes(label=pctlabel), colour = "#8c9191") +
  scale_fill_distiller(palette = "YlGnBu", direction = 1) +
  guides(fill = guide_legend(title = "Pecent (%)")) +
  theme_minimal() +
  labs(x = " ", 
       y = " ", 
       title  = "Customer's spendature during Weekdays and Weekends")


Results:

  • No matter if it is the weekdays or the weekends, majority of the customers will only spend around $0 to $200.
  • On the weekends however, more customers spend $200 to $400 and $400 to $600 compared to when on the weekdays.

Conclusion

  • Based on the results, we can conclude that:

    • Descriptive Problem: The data shows that there is a preferred payment method based on the product lines
    • Predictive Problem: The data shows that there is a trend on the number of customers at certain days and time, which provides information for business decisions such as targeted time to offerdiscounts.
    • Associative Problem: There is weak causal relationship observed between average sales on weekdays vs weekends.

  • EDA does not determine models on the data, but it helps business owners to explore possible data analysis models that best suit the data based on the business problems and goals.

  • In conclusion, we have successfully meet our objective by answering the questions posed in the beginning for different types of problems.