Group Project WQD7004 Programming for Data Science

Supermarket Sales

Introduction

Supermarket conventionally are defined as places where people go to buy their food, beverages and household products. Products or items in the supermarket are grouped into sections which allows people to conveniently and effectively search and purchase the required items. This grouping of section in the supermarket is determined using the supermarket to extract insights on the purchasing interest of the consumer. In most cases, historical data of supermarket sales are analysed to make informed decisions in the management of the supermarket.

The dataset to be used in this report is obtained from Kaggle at https://www.kaggle.com/aungpyaeap/supermarket-sales. To summarise the content of the dataset, this dataset consists of sales of a supermarket company in 3 different branches for 3 months. Based on the data, it is observed that there are 1000 rows and 17 columns, with each column defined as follows:

Variables	Description
Invoice id	Computer generated sales slip invoice identification number
Branch	Branch of supercenter (3 branches are available identified by A, B and C).
City	Location of supercenters
Customer type	Type of customers, recorded by Members for customers using member card and Normal for without member card.
Gender	Gender type of customer
Product line	General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
Unit price	Price of each product in $
Quantity	Number of products purchased by customer
Tax	5% tax fee for customer buying
Total	Total price including tax
Date	Date of purchase (Record available from January 2019 to March 2019)
Time	Purchase time (10am to 9pm)
Payment	Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)
COGS	Cost of goods sold
Gross margin percentage	Gross margin percentage
Gross income	Gross income
Rating	Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

Objectives

Our objective is to find valuable insights and pattern from the datasets. We have come out with several question that we wanted to figure out as per the following:

Are there any relationship between customer ratings based on customer type?
Which gender has purchase the most within the three cities?
What is the trend of total sales from January to March?
When is the specific day in a week where the highest purchase was made?
By which payment type had the sales been generated by the supermarket?
Is there any customer behavioural segmentation besides customer type to be found in the dataset?

Data Preprocessing

# install.packages('corrplot')
# install.packages("Hmisc")
# install.packages('rpart')
# install.packages('rpart.plot')
# install.packages('caret')
# install.packages('fastDummies')
# install.packages('fpc')
# install.packages('factoextra')

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(corrplot)

## corrplot 0.92 loaded

library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
library(Hmisc)

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(purrr)

## 
## Attaching package: 'purrr'

## The following object is masked from 'package:caret':
## 
##     lift

library(rpart)
library(rpart.plot)
library(tidyr)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(fastDummies)
library(fpc)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

# Suppress summarise info
options(dplyr.summarise.inform = FALSE)

# load dataframe from github
supermarket_raw_df <- read.csv("https://raw.githubusercontent.com/johnjon98/wqd7004/master/Supermarket%20edited%20Raw%20data(uncleaned).csv", header=TRUE)
head(supermarket_raw_df)

##    Invoice.ID Branch      City Customer.type Gender           Product.line
## 1 750-67-8428      A    Yangon        Member Female      Health and beauty
## 2 226-31-3081      C Naypyitaw        Normal Female Electronic accessories
## 3 631-41-3108      A    Yangon        Normal   Male     Home and lifestyle
## 4 123-19-1176      A    Yangon        Member   Male      Health and beauty
## 5 373-73-7910      A    Yangon        Normal   Male      Sports and travel
## 6 699-14-3026      C Naypyitaw        Normal   Male Electronic accessories
##   Unit.price Quantity  Tax.5. Total..Sales       Date  Time     Payment   cogs
## 1      74.69        7 26.1415     548.9715 01/05/2019 13:08     Ewallet 522.83
## 2      15.28        5  3.8200      80.2200 03/08/2019 10:29        Cash  76.40
## 3      46.33        7 16.2155     340.5255 03/03/2019 13:23 Credit card 324.31
## 4      58.22        8 23.2880     489.0480  1/27/2019 20:33     Ewallet 465.76
## 5      86.31        7 30.2085     634.3785 02/08/2019 10:37     Ewallet 604.17
## 6      85.39        7 29.8865     627.6165  3/25/2019 18:30     Ewallet 597.73
##   gross.margin.percentage gross.income Rating
## 1                4.761905      26.1415    9.1
## 2                4.761905       3.8200    9.6
## 3                4.761905      16.2155    7.4
## 4                4.761905      23.2880    8.4
## 5                4.761905      30.2085    5.3
## 6                4.761905      29.8865    4.1

# check for null values
as.data.frame(colSums(is.na(supermarket_raw_df)))

##                         colSums(is.na(supermarket_raw_df))
## Invoice.ID                                               0
## Branch                                                   0
## City                                                     0
## Customer.type                                            0
## Gender                                                   0
## Product.line                                             0
## Unit.price                                               4
## Quantity                                                 4
## Tax.5.                                                   0
## Total..Sales                                             0
## Date                                                     0
## Time                                                     0
## Payment                                                  0
## cogs                                                     0
## gross.margin.percentage                                 11
## gross.income                                             0
## Rating                                                   9

# Remove unwanted_variables
supermarket_1_df <- supermarket_raw_df %>% select(-Invoice.ID, -gross.margin.percentage, -Time, -cogs, -gross.income)

# Feature Extraction - Date Format 
supermarket_2_df <- supermarket_1_df %>% 
  mutate(month=month(as.POSIXlt(Date, format="%m/%d/%Y")), day=day(as.POSIXlt(Date, format="%m/%d/%Y"))) #%>%
  #select(-Date)
supermarket_2_df <- supermarket_2_df[,-1]
supermarket_2_df$Date <- as.Date(supermarket_2_df$Date, "%m/%d/%y")
year(supermarket_2_df$Date) <- 2019

head(supermarket_2_df)

##        City Customer.type Gender           Product.line Unit.price Quantity
## 1    Yangon        Member Female      Health and beauty      74.69        7
## 2 Naypyitaw        Normal Female Electronic accessories      15.28        5
## 3    Yangon        Normal   Male     Home and lifestyle      46.33        7
## 4    Yangon        Member   Male      Health and beauty      58.22        8
## 5    Yangon        Normal   Male      Sports and travel      86.31        7
## 6 Naypyitaw        Normal   Male Electronic accessories      85.39        7
##    Tax.5. Total..Sales       Date     Payment Rating month day
## 1 26.1415     548.9715 2019-01-05     Ewallet    9.1     1   5
## 2  3.8200      80.2200 2019-03-08        Cash    9.6     3   8
## 3 16.2155     340.5255 2019-03-03 Credit card    7.4     3   3
## 4 23.2880     489.0480 2019-01-27     Ewallet    8.4     1  27
## 5 30.2085     634.3785 2019-02-08     Ewallet    5.3     2   8
## 6 29.8865     627.6165 2019-03-25     Ewallet    4.1     3  25

# Imputation for missing values (mean or other) 
supermarket_3_df <- supermarket_2_df
supermarket_3_df$Rating <- with(supermarket_3_df, impute(Rating, mean))

# Imputation - Calculate quantity and unit price
supermarket_4_df <- supermarket_3_df[rowSums(is.na(supermarket_3_df)) > 0,]

supermarket_4_df <- supermarket_3_df %>%
  mutate(impute_quantity = ifelse(!is.na(Unit.price), ((Total..Sales - Tax.5.)/Unit.price), Quantity),
    impute_unitprice = ifelse(!is.na(Quantity), ((Total..Sales - Tax.5.)/Quantity), Unit.price) 
  ) %>%
  select(-Unit.price, -Quantity)

# Add column weekday
supermarket_clean_df <- mutate(supermarket_4_df, Weekday = wday(Date))

supermarket_clean_df$Weekday <- factor(supermarket_clean_df$Weekday, levels=c(1,2,3,4,5,6,7), 
                         labels=c("Sunday","Monday","Tuesday","Wednesday","Thursday",
                                  "Friday","Saturday"))
head(supermarket_clean_df)

##        City Customer.type Gender           Product.line  Tax.5. Total..Sales
## 1    Yangon        Member Female      Health and beauty 26.1415     548.9715
## 2 Naypyitaw        Normal Female Electronic accessories  3.8200      80.2200
## 3    Yangon        Normal   Male     Home and lifestyle 16.2155     340.5255
## 4    Yangon        Member   Male      Health and beauty 23.2880     489.0480
## 5    Yangon        Normal   Male      Sports and travel 30.2085     634.3785
## 6 Naypyitaw        Normal   Male Electronic accessories 29.8865     627.6165
##         Date     Payment Rating month day impute_quantity impute_unitprice
## 1 2019-01-05     Ewallet    9.1     1   5               7            74.69
## 2 2019-03-08        Cash    9.6     3   8               5            15.28
## 3 2019-03-03 Credit card    7.4     3   3               7            46.33
## 4 2019-01-27     Ewallet    8.4     1  27               8            58.22
## 5 2019-02-08     Ewallet    5.3     2   8               7            86.31
## 6 2019-03-25     Ewallet    4.1     3  25               7            85.39
##    Weekday
## 1 Saturday
## 2   Friday
## 3   Sunday
## 4   Sunday
## 5   Friday
## 6   Monday

as.data.frame(colSums(is.na(supermarket_clean_df)))

##                  colSums(is.na(supermarket_clean_df))
## City                                                0
## Customer.type                                       0
## Gender                                              0
## Product.line                                        0
## Tax.5.                                              0
## Total..Sales                                        0
## Date                                                0
## Payment                                             0
## Rating                                              0
## month                                               0
## day                                                 0
## impute_quantity                                     0
## impute_unitprice                                    0
## Weekday                                             0

Exploratory Data Analysis

dim(supermarket_clean_df)

## [1] 1000   14

str(supermarket_clean_df)

## 'data.frame':    1000 obs. of  14 variables:
##  $ City            : chr  "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
##  $ Customer.type   : chr  "Member" "Normal" "Normal" "Member" ...
##  $ Gender          : chr  "Female" "Female" "Male" "Male" ...
##  $ Product.line    : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Tax.5.          : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Total..Sales    : num  549 80.2 340.5 489 634.4 ...
##  $ Date            : Date, format: "2019-01-05" "2019-03-08" ...
##  $ Payment         : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ Rating          : 'impute' num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##   ..- attr(*, "imputed")= int [1:9] 20 63 67 130 167 170 845 891 953
##  $ month           : num  1 3 3 1 2 3 2 2 1 2 ...
##  $ day             : int  5 8 3 27 8 25 25 24 10 20 ...
##  $ impute_quantity : num  7 5 7 8 7 7 6 10 2 3 ...
##  $ impute_unitprice: num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Weekday         : Factor w/ 7 levels "Sunday","Monday",..: 7 6 1 1 6 2 2 1 5 4 ...

summary(supermarket_clean_df)

## 
##  9 values imputed to 6.972048

##      City           Customer.type         Gender          Product.line      
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      Tax.5.         Total..Sales          Date              Payment         
##  Min.   : 0.5085   Min.   :  10.68   Min.   :2019-01-01   Length:1000       
##  1st Qu.: 5.9249   1st Qu.: 124.42   1st Qu.:2019-01-24   Class :character  
##  Median :12.0880   Median : 253.85   Median :2019-02-13   Mode  :character  
##  Mean   :15.3794   Mean   : 322.97   Mean   :2019-02-14                     
##  3rd Qu.:22.4453   3rd Qu.: 471.35   3rd Qu.:2019-03-08                     
##  Max.   :49.6500   Max.   :1042.65   Max.   :2019-03-30                     
##                                                                             
##      Rating           month            day        impute_quantity
##  Min.   : 4.000   Min.   :1.000   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 5.500   1st Qu.:1.000   1st Qu.: 8.00   1st Qu.: 3.00  
##  Median : 6.972   Median :2.000   Median :15.00   Median : 5.00  
##  Mean   : 6.972   Mean   :1.993   Mean   :15.26   Mean   : 5.51  
##  3rd Qu.: 8.400   3rd Qu.:3.000   3rd Qu.:23.00   3rd Qu.: 8.00  
##  Max.   :10.000   Max.   :3.000   Max.   :31.00   Max.   :10.00  
##                                                                  
##  impute_unitprice      Weekday   
##  Min.   :10.08    Sunday   :133  
##  1st Qu.:32.88    Monday   :125  
##  Median :55.23    Tuesday  :158  
##  Mean   :55.67    Wednesday:143  
##  3rd Qu.:77.94    Thursday :138  
##  Max.   :99.96    Friday   :139  
##                   Saturday :164

# plot histogram of numeric variables
 supermarket_clean_df%>% 
  select(-month, -day) %>%
  purrr::keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

## Warning: attributes are not identical across measure variables;
## they will be dropped

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Correlation - to find relationship between numerical variables
numeric <- map_lgl(supermarket_clean_df,is.numeric)
correlations <- cor(supermarket_clean_df[,numeric])
diag(correlations) <- 0
corrplot(correlations,method = 'color', tl.col = 'black',
         col = COL1('YlGn'), addgrid.col = 'white', addCoef.col = 'white', type='upper')

Results and Discussion

Question 1 - Are there any relationship between customer ratings based on customer type?

Based on the chart below, it was observed that at lower ratings, normal customer are more prone to giving low ratings. On the other hand, members are prone to giving higher ratings. This could be due to members receiving better customer service and discounts. Therefore, increasing their overall satisfaction.

supermarket_df <- supermarket_clean_df
supermarket_df %>% 
  select(Customer.type, Rating) %>%
  group_by(Customer.type, Rating) %>%
  summarise(count_rating=n()) %>% 
  ggplot(aes(x=Rating, y=count_rating, color=Customer.type)) + 
  # geom_point(stat="identity", position="identity") +
  geom_smooth(method="loess", se=FALSE) +
  ggtitle("Line Chart of number of rating against Rating")

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

## `geom_smooth()` using formula 'y ~ x'

Question 2 - Which gender has purchase the most within the three cities?

In general , there is no big difference between the purchases made by males and females. However,it can be seen that more purchases made by females than males in all categories except Health and Beauty.

supermarket_df %>% 
  filter(Product.line != "") %>%
  group_by(Gender,Product.line) %>% 
  summarise(Sum=sum(impute_quantity))%>% 
  ggplot(aes(x=reorder(Product.line,Sum),Sum, fill = Gender, label=Sum))+
  geom_bar(stat="identity",position="dodge")+
  #geom_text(size = 5,vjust=1,hjust = 1.2, position =position_dodge(.9))+
  xlab("Product Line")+
  coord_flip()+
  ggtitle("Purchases made by Gender")

Question 3 - What is the trend of total sales from January to March?

From this chart, we can see that there is no significant trend for the sales from January to March in Yangon, Mandalay, and Naypytitaw. The highest daily sales for Yangon and Naypyitaw have been achieved in January and March for Mandalay. Whereas the lowest daily sales for Yangon and Mandalay was reached in February and March for Naypyitaw.

Yangon <- supermarket_df %>% filter(City == "Yangon")
total_Yangon <- data.frame(xtabs(formula = Total..Sales~Date, data = Yangon))
total_Yangon$Date <- as.Date(total_Yangon$Date)
Mandalay <- supermarket_df %>% filter(City == "Mandalay")
total_Mandalay <- data.frame(xtabs(formula = Total..Sales~Date, data = Mandalay))
total_Mandalay$Date <- as.Date(total_Mandalay$Date)
Naypyitaw <- supermarket_df %>% filter(City == "Naypyitaw")
total_Naypyitaw <- data.frame(xtabs(formula = Total..Sales~Date, data = Naypyitaw))
total_Naypyitaw$Date <- as.Date(total_Naypyitaw$Date)

plot1 <- ggplot(data = total_Yangon, mapping = aes(x = Date, y = Freq))+ geom_line(color="blue",alpha=0.4)+
  theme_linedraw()+ ggtitle("Total Sales per day in Yangon")+ xlab("Date")+ ylab("Total Sales Per Day")
plot2 <- ggplot(data = total_Mandalay, mapping = aes(x = Date, y = Freq))+ geom_line(color="red",alpha=0.5)+
  theme_linedraw()+ ggtitle("Total Sales per day in Mandalay")+ xlab("Date")+ ylab("Total Sales Per Day")
plot3 <- ggplot(data = total_Naypyitaw, mapping = aes(x = Date, y = Freq))+ geom_line(color="green",alpha=0.7)+
  theme_linedraw()+ ggtitle("Total Sales per day in Naypyitaw")+ xlab("Date")+ ylab("Total Sales Per Day")

grid.arrange(plot1, plot2, plot3)

Question 4 - When is the specific day in a week where the highest purchase was made?

This chart showed that the highest sales is on Saturday as most of the people would prefer to buy groceries on weekend. On the contrary, Mondays is the lowest in sales, probably because it is start of the working week. Interestingly, we can see that Tuesday is the second highest sales in a week.

ggplot(supermarket_clean_df)+ 
  geom_bar(mapping = aes(x=Weekday, color=City), fill='white', 
           position='dodge')+
  ggtitle("Total Sales Per Week")+ 
  xlab("Weekday")+ 
  ylab("Total Sales Per Week")

Question 5 - Is there any combination of factors from customers that lead to their payment type while purchasing at the supermarket?

Based on the decision tree generated, there are several different combination of factors that lead to customer choose whether to make Cash payment type, E-wallet payment type or Credit Card payment type. The 3 combination of factors that can lead to each different payment type are:

IF customer purchase item with unit price less than 51 & less than 30, AND not choose Product Line Fashion Accessories, AND with total sales more than equals to 116, THEN customer preferably choose CREDIT CARD payment type.
IF customer purchase item with unit price more than 51, AND within the first 3 months, AND purchase unite price less than 53, THEN customer preferably choose CASH payment type.
IF customer purchase item with unit price less than 51 & less than 30, AND not choose Product Line Fashion Accessories, AND purchase on Saturday, THEN customer preferably choose E-wallet payment type.

#remove unwanted_variables
supermarket_dtree <- supermarket_df %>% select(-Tax.5., -Date, -day)

set.seed(3033)
intrain <- createDataPartition(y = supermarket_dtree$Payment, p= 0.7, list = FALSE)
training <- supermarket_dtree[intrain,]
testing <- supermarket_dtree[-intrain,]

#check dimensions of train & test set
dim(training)

## [1] 703  11

dim(testing)

## [1] 297  11

set.seed(3333)
dtree_fit <- train(Payment ~., data = training, method = "rpart",
                   parms = list(split = "information"),
                   tuneLength = 10)
dtree_fit

## CART 
## 
## 703 samples
##  10 predictor
##   4 classes: '', 'Cash', 'Credit card', 'Ewallet' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 703, 703, 703, 703, 703, 703, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa        
##   0.003607504  0.3340878   5.806463e-03
##   0.004329004  0.3334560   5.360985e-03
##   0.005050505  0.3310504   1.889207e-03
##   0.005952381  0.3297721   3.194189e-04
##   0.006493506  0.3300802   5.603270e-04
##   0.010822511  0.3270452  -4.862290e-03
##   0.019480519  0.3292546   1.676463e-05
##   0.021645022  0.3246012  -7.343843e-03
##   0.025974026  0.3273440  -5.080364e-03
##   0.036796537  0.3228411  -6.752321e-03
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003607504.

prp(dtree_fit$finalModel, box.palette = "Reds", tweak = 1.2)

Question 6 - Is there any customer behavioural segmentation besides customer type to be found in the dataset?

df <- supermarket_df
results <- fastDummies::dummy_cols(df, remove_first_dummy = TRUE)


results[,1:4]    <- NULL
results$Payment  <- NULL
results$Date     <- NULL
results$Weekday     <- NULL

resultsca <- scale(results)

set.seed(20)
k.means.fit <- kmeans(resultsca,2,25)

# plotcluster(resultsca, k.means.fit$cluster)
# fviz_cluster(k.means.fit, data = resultsca)

k2 <- kmeans(resultsca, centers = 2, nstart = 25)
k3 <- kmeans(resultsca, centers = 3, nstart = 25)
k4 <- kmeans(resultsca, centers = 4, nstart = 25)
k5 <- kmeans(resultsca, centers = 5, nstart = 25)
p1 <- fviz_cluster(k2, geom = "point", data = resultsca) + ggtitle('k = 2')
p2 <- fviz_cluster(k3, geom = "point", data = resultsca) + ggtitle('k = 3')
p3 <- fviz_cluster(k4, geom = "point", data = resultsca) + ggtitle('k = 4')
p4 <- fviz_cluster(k5, geom = "point", data = resultsca) + ggtitle('k = 5')

grid.arrange(p1, p2, p3, p4, nrow=2)

df$cluster <- k.means.fit$cluster

Conclusion

Normal customers give lower ratings and members give higher ratings.

More purchases made by females than males across all product lines in general.

No significant trend was observed for sales from January to March.

Saturday is the day with highest sales, Monday is the day with lowest sales.

Various factors contribute to customers choosing to pay with either credit card, cash, or E-wallet

Customer segmentation can be seen according to the clustering.