Supermarket conventionally are defined as places where people go to buy their food, beverages and household products. Products or items in the supermarket are grouped into sections which allows people to conveniently and effectively search and purchase the required items. This grouping of section in the supermarket is determined using the supermarket to extract insights on the purchasing interest of the consumer. In most cases, historical data of supermarket sales are analysed to make informed decisions in the management of the supermarket.
The dataset to be used in this report is obtained from Kaggle at https://www.kaggle.com/aungpyaeap/supermarket-sales. To summarise the content of the dataset, this dataset consists of sales of a supermarket company in 3 different branches for 3 months. Based on the data, it is observed that there are 1000 rows and 17 columns, with each column defined as follows:
| Variables | Description |
|---|---|
| Invoice id | Computer generated sales slip invoice identification number |
| Branch | Branch of supercenter (3 branches are available identified by A, B and C). |
| City | Location of supercenters |
| Customer type | Type of customers, recorded by Members for customers using member card and Normal for without member card. |
| Gender | Gender type of customer |
| Product line | General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel |
| Unit price | Price of each product in $ |
| Quantity | Number of products purchased by customer |
| Tax | 5% tax fee for customer buying |
| Total | Total price including tax |
| Date | Date of purchase (Record available from January 2019 to March 2019) |
| Time | Purchase time (10am to 9pm) |
| Payment | Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet) |
| COGS | Cost of goods sold |
| Gross margin percentage | Gross margin percentage |
| Gross income | Gross income |
| Rating | Customer stratification rating on their overall shopping experience (On a scale of 1 to 10) |
Our objective is to find valuable insights and pattern from the datasets. We have come out with several question that we wanted to figure out as per the following:
# install.packages('corrplot')
# install.packages("Hmisc")
# install.packages('rpart')
# install.packages('rpart.plot')
# install.packages('caret')
# install.packages('fastDummies')
# install.packages('fpc')
# install.packages('factoextra')
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(corrplot)
## corrplot 0.92 loaded
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
library(Hmisc)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:caret':
##
## lift
library(rpart)
library(rpart.plot)
library(tidyr)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(fastDummies)
library(fpc)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Suppress summarise info
options(dplyr.summarise.inform = FALSE)
# load dataframe from github
supermarket_raw_df <- read.csv("https://raw.githubusercontent.com/johnjon98/wqd7004/master/Supermarket%20edited%20Raw%20data(uncleaned).csv", header=TRUE)
head(supermarket_raw_df)
## Invoice.ID Branch City Customer.type Gender Product.line
## 1 750-67-8428 A Yangon Member Female Health and beauty
## 2 226-31-3081 C Naypyitaw Normal Female Electronic accessories
## 3 631-41-3108 A Yangon Normal Male Home and lifestyle
## 4 123-19-1176 A Yangon Member Male Health and beauty
## 5 373-73-7910 A Yangon Normal Male Sports and travel
## 6 699-14-3026 C Naypyitaw Normal Male Electronic accessories
## Unit.price Quantity Tax.5. Total..Sales Date Time Payment cogs
## 1 74.69 7 26.1415 548.9715 01/05/2019 13:08 Ewallet 522.83
## 2 15.28 5 3.8200 80.2200 03/08/2019 10:29 Cash 76.40
## 3 46.33 7 16.2155 340.5255 03/03/2019 13:23 Credit card 324.31
## 4 58.22 8 23.2880 489.0480 1/27/2019 20:33 Ewallet 465.76
## 5 86.31 7 30.2085 634.3785 02/08/2019 10:37 Ewallet 604.17
## 6 85.39 7 29.8865 627.6165 3/25/2019 18:30 Ewallet 597.73
## gross.margin.percentage gross.income Rating
## 1 4.761905 26.1415 9.1
## 2 4.761905 3.8200 9.6
## 3 4.761905 16.2155 7.4
## 4 4.761905 23.2880 8.4
## 5 4.761905 30.2085 5.3
## 6 4.761905 29.8865 4.1
# check for null values
as.data.frame(colSums(is.na(supermarket_raw_df)))
## colSums(is.na(supermarket_raw_df))
## Invoice.ID 0
## Branch 0
## City 0
## Customer.type 0
## Gender 0
## Product.line 0
## Unit.price 4
## Quantity 4
## Tax.5. 0
## Total..Sales 0
## Date 0
## Time 0
## Payment 0
## cogs 0
## gross.margin.percentage 11
## gross.income 0
## Rating 9
# Remove unwanted_variables
supermarket_1_df <- supermarket_raw_df %>% select(-Invoice.ID, -gross.margin.percentage, -Time, -cogs, -gross.income)
# Feature Extraction - Date Format
supermarket_2_df <- supermarket_1_df %>%
mutate(month=month(as.POSIXlt(Date, format="%m/%d/%Y")), day=day(as.POSIXlt(Date, format="%m/%d/%Y"))) #%>%
#select(-Date)
supermarket_2_df <- supermarket_2_df[,-1]
supermarket_2_df$Date <- as.Date(supermarket_2_df$Date, "%m/%d/%y")
year(supermarket_2_df$Date) <- 2019
head(supermarket_2_df)
## City Customer.type Gender Product.line Unit.price Quantity
## 1 Yangon Member Female Health and beauty 74.69 7
## 2 Naypyitaw Normal Female Electronic accessories 15.28 5
## 3 Yangon Normal Male Home and lifestyle 46.33 7
## 4 Yangon Member Male Health and beauty 58.22 8
## 5 Yangon Normal Male Sports and travel 86.31 7
## 6 Naypyitaw Normal Male Electronic accessories 85.39 7
## Tax.5. Total..Sales Date Payment Rating month day
## 1 26.1415 548.9715 2019-01-05 Ewallet 9.1 1 5
## 2 3.8200 80.2200 2019-03-08 Cash 9.6 3 8
## 3 16.2155 340.5255 2019-03-03 Credit card 7.4 3 3
## 4 23.2880 489.0480 2019-01-27 Ewallet 8.4 1 27
## 5 30.2085 634.3785 2019-02-08 Ewallet 5.3 2 8
## 6 29.8865 627.6165 2019-03-25 Ewallet 4.1 3 25
# Imputation for missing values (mean or other)
supermarket_3_df <- supermarket_2_df
supermarket_3_df$Rating <- with(supermarket_3_df, impute(Rating, mean))
# Imputation - Calculate quantity and unit price
supermarket_4_df <- supermarket_3_df[rowSums(is.na(supermarket_3_df)) > 0,]
supermarket_4_df <- supermarket_3_df %>%
mutate(impute_quantity = ifelse(!is.na(Unit.price), ((Total..Sales - Tax.5.)/Unit.price), Quantity),
impute_unitprice = ifelse(!is.na(Quantity), ((Total..Sales - Tax.5.)/Quantity), Unit.price)
) %>%
select(-Unit.price, -Quantity)
# Add column weekday
supermarket_clean_df <- mutate(supermarket_4_df, Weekday = wday(Date))
supermarket_clean_df$Weekday <- factor(supermarket_clean_df$Weekday, levels=c(1,2,3,4,5,6,7),
labels=c("Sunday","Monday","Tuesday","Wednesday","Thursday",
"Friday","Saturday"))
head(supermarket_clean_df)
## City Customer.type Gender Product.line Tax.5. Total..Sales
## 1 Yangon Member Female Health and beauty 26.1415 548.9715
## 2 Naypyitaw Normal Female Electronic accessories 3.8200 80.2200
## 3 Yangon Normal Male Home and lifestyle 16.2155 340.5255
## 4 Yangon Member Male Health and beauty 23.2880 489.0480
## 5 Yangon Normal Male Sports and travel 30.2085 634.3785
## 6 Naypyitaw Normal Male Electronic accessories 29.8865 627.6165
## Date Payment Rating month day impute_quantity impute_unitprice
## 1 2019-01-05 Ewallet 9.1 1 5 7 74.69
## 2 2019-03-08 Cash 9.6 3 8 5 15.28
## 3 2019-03-03 Credit card 7.4 3 3 7 46.33
## 4 2019-01-27 Ewallet 8.4 1 27 8 58.22
## 5 2019-02-08 Ewallet 5.3 2 8 7 86.31
## 6 2019-03-25 Ewallet 4.1 3 25 7 85.39
## Weekday
## 1 Saturday
## 2 Friday
## 3 Sunday
## 4 Sunday
## 5 Friday
## 6 Monday
as.data.frame(colSums(is.na(supermarket_clean_df)))
## colSums(is.na(supermarket_clean_df))
## City 0
## Customer.type 0
## Gender 0
## Product.line 0
## Tax.5. 0
## Total..Sales 0
## Date 0
## Payment 0
## Rating 0
## month 0
## day 0
## impute_quantity 0
## impute_unitprice 0
## Weekday 0
dim(supermarket_clean_df)
## [1] 1000 14
str(supermarket_clean_df)
## 'data.frame': 1000 obs. of 14 variables:
## $ City : chr "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
## $ Customer.type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Product.line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Tax.5. : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Total..Sales : num 549 80.2 340.5 489 634.4 ...
## $ Date : Date, format: "2019-01-05" "2019-03-08" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ Rating : 'impute' num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## ..- attr(*, "imputed")= int [1:9] 20 63 67 130 167 170 845 891 953
## $ month : num 1 3 3 1 2 3 2 2 1 2 ...
## $ day : int 5 8 3 27 8 25 25 24 10 20 ...
## $ impute_quantity : num 7 5 7 8 7 7 6 10 2 3 ...
## $ impute_unitprice: num 74.7 15.3 46.3 58.2 86.3 ...
## $ Weekday : Factor w/ 7 levels "Sunday","Monday",..: 7 6 1 1 6 2 2 1 5 4 ...
summary(supermarket_clean_df)
##
## 9 values imputed to 6.972048
## City Customer.type Gender Product.line
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Tax.5. Total..Sales Date Payment
## Min. : 0.5085 Min. : 10.68 Min. :2019-01-01 Length:1000
## 1st Qu.: 5.9249 1st Qu.: 124.42 1st Qu.:2019-01-24 Class :character
## Median :12.0880 Median : 253.85 Median :2019-02-13 Mode :character
## Mean :15.3794 Mean : 322.97 Mean :2019-02-14
## 3rd Qu.:22.4453 3rd Qu.: 471.35 3rd Qu.:2019-03-08
## Max. :49.6500 Max. :1042.65 Max. :2019-03-30
##
## Rating month day impute_quantity
## Min. : 4.000 Min. :1.000 Min. : 1.00 Min. : 1.00
## 1st Qu.: 5.500 1st Qu.:1.000 1st Qu.: 8.00 1st Qu.: 3.00
## Median : 6.972 Median :2.000 Median :15.00 Median : 5.00
## Mean : 6.972 Mean :1.993 Mean :15.26 Mean : 5.51
## 3rd Qu.: 8.400 3rd Qu.:3.000 3rd Qu.:23.00 3rd Qu.: 8.00
## Max. :10.000 Max. :3.000 Max. :31.00 Max. :10.00
##
## impute_unitprice Weekday
## Min. :10.08 Sunday :133
## 1st Qu.:32.88 Monday :125
## Median :55.23 Tuesday :158
## Mean :55.67 Wednesday:143
## 3rd Qu.:77.94 Thursday :138
## Max. :99.96 Friday :139
## Saturday :164
# plot histogram of numeric variables
supermarket_clean_df%>%
select(-month, -day) %>%
purrr::keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## Warning: attributes are not identical across measure variables;
## they will be dropped
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Correlation - to find relationship between numerical variables
numeric <- map_lgl(supermarket_clean_df,is.numeric)
correlations <- cor(supermarket_clean_df[,numeric])
diag(correlations) <- 0
corrplot(correlations,method = 'color', tl.col = 'black',
col = COL1('YlGn'), addgrid.col = 'white', addCoef.col = 'white', type='upper')
Based on the chart below, it was observed that at lower ratings, normal customer are more prone to giving low ratings. On the other hand, members are prone to giving higher ratings. This could be due to members receiving better customer service and discounts. Therefore, increasing their overall satisfaction.
supermarket_df <- supermarket_clean_df
supermarket_df %>%
select(Customer.type, Rating) %>%
group_by(Customer.type, Rating) %>%
summarise(count_rating=n()) %>%
ggplot(aes(x=Rating, y=count_rating, color=Customer.type)) +
# geom_point(stat="identity", position="identity") +
geom_smooth(method="loess", se=FALSE) +
ggtitle("Line Chart of number of rating against Rating")
## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.
## `geom_smooth()` using formula 'y ~ x'
In general , there is no big difference between the purchases made by males and females. However,it can be seen that more purchases made by females than males in all categories except Health and Beauty.
supermarket_df %>%
filter(Product.line != "") %>%
group_by(Gender,Product.line) %>%
summarise(Sum=sum(impute_quantity))%>%
ggplot(aes(x=reorder(Product.line,Sum),Sum, fill = Gender, label=Sum))+
geom_bar(stat="identity",position="dodge")+
#geom_text(size = 5,vjust=1,hjust = 1.2, position =position_dodge(.9))+
xlab("Product Line")+
coord_flip()+
ggtitle("Purchases made by Gender")
From this chart, we can see that there is no significant trend for the sales from January to March in Yangon, Mandalay, and Naypytitaw. The highest daily sales for Yangon and Naypyitaw have been achieved in January and March for Mandalay. Whereas the lowest daily sales for Yangon and Mandalay was reached in February and March for Naypyitaw.
Yangon <- supermarket_df %>% filter(City == "Yangon")
total_Yangon <- data.frame(xtabs(formula = Total..Sales~Date, data = Yangon))
total_Yangon$Date <- as.Date(total_Yangon$Date)
Mandalay <- supermarket_df %>% filter(City == "Mandalay")
total_Mandalay <- data.frame(xtabs(formula = Total..Sales~Date, data = Mandalay))
total_Mandalay$Date <- as.Date(total_Mandalay$Date)
Naypyitaw <- supermarket_df %>% filter(City == "Naypyitaw")
total_Naypyitaw <- data.frame(xtabs(formula = Total..Sales~Date, data = Naypyitaw))
total_Naypyitaw$Date <- as.Date(total_Naypyitaw$Date)
plot1 <- ggplot(data = total_Yangon, mapping = aes(x = Date, y = Freq))+ geom_line(color="blue",alpha=0.4)+
theme_linedraw()+ ggtitle("Total Sales per day in Yangon")+ xlab("Date")+ ylab("Total Sales Per Day")
plot2 <- ggplot(data = total_Mandalay, mapping = aes(x = Date, y = Freq))+ geom_line(color="red",alpha=0.5)+
theme_linedraw()+ ggtitle("Total Sales per day in Mandalay")+ xlab("Date")+ ylab("Total Sales Per Day")
plot3 <- ggplot(data = total_Naypyitaw, mapping = aes(x = Date, y = Freq))+ geom_line(color="green",alpha=0.7)+
theme_linedraw()+ ggtitle("Total Sales per day in Naypyitaw")+ xlab("Date")+ ylab("Total Sales Per Day")
grid.arrange(plot1, plot2, plot3)
This chart showed that the highest sales is on Saturday as most of the people would prefer to buy groceries on weekend. On the contrary, Mondays is the lowest in sales, probably because it is start of the working week. Interestingly, we can see that Tuesday is the second highest sales in a week.
ggplot(supermarket_clean_df)+
geom_bar(mapping = aes(x=Weekday, color=City), fill='white',
position='dodge')+
ggtitle("Total Sales Per Week")+
xlab("Weekday")+
ylab("Total Sales Per Week")
Based on the decision tree generated, there are several different combination of factors that lead to customer choose whether to make Cash payment type, E-wallet payment type or Credit Card payment type. The 3 combination of factors that can lead to each different payment type are:
IF customer purchase item with unit price less than 51 & less than 30, AND not choose Product Line Fashion Accessories, AND with total sales more than equals to 116, THEN customer preferably choose CREDIT CARD payment type.
IF customer purchase item with unit price more than 51, AND within the first 3 months, AND purchase unite price less than 53, THEN customer preferably choose CASH payment type.
IF customer purchase item with unit price less than 51 & less than 30, AND not choose Product Line Fashion Accessories, AND purchase on Saturday, THEN customer preferably choose E-wallet payment type.
#remove unwanted_variables
supermarket_dtree <- supermarket_df %>% select(-Tax.5., -Date, -day)
set.seed(3033)
intrain <- createDataPartition(y = supermarket_dtree$Payment, p= 0.7, list = FALSE)
training <- supermarket_dtree[intrain,]
testing <- supermarket_dtree[-intrain,]
#check dimensions of train & test set
dim(training)
## [1] 703 11
dim(testing)
## [1] 297 11
set.seed(3333)
dtree_fit <- train(Payment ~., data = training, method = "rpart",
parms = list(split = "information"),
tuneLength = 10)
dtree_fit
## CART
##
## 703 samples
## 10 predictor
## 4 classes: '', 'Cash', 'Credit card', 'Ewallet'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 703, 703, 703, 703, 703, 703, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.003607504 0.3340878 5.806463e-03
## 0.004329004 0.3334560 5.360985e-03
## 0.005050505 0.3310504 1.889207e-03
## 0.005952381 0.3297721 3.194189e-04
## 0.006493506 0.3300802 5.603270e-04
## 0.010822511 0.3270452 -4.862290e-03
## 0.019480519 0.3292546 1.676463e-05
## 0.021645022 0.3246012 -7.343843e-03
## 0.025974026 0.3273440 -5.080364e-03
## 0.036796537 0.3228411 -6.752321e-03
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003607504.
prp(dtree_fit$finalModel, box.palette = "Reds", tweak = 1.2)
df <- supermarket_df
results <- fastDummies::dummy_cols(df, remove_first_dummy = TRUE)
results[,1:4] <- NULL
results$Payment <- NULL
results$Date <- NULL
results$Weekday <- NULL
resultsca <- scale(results)
set.seed(20)
k.means.fit <- kmeans(resultsca,2,25)
# plotcluster(resultsca, k.means.fit$cluster)
# fviz_cluster(k.means.fit, data = resultsca)
k2 <- kmeans(resultsca, centers = 2, nstart = 25)
k3 <- kmeans(resultsca, centers = 3, nstart = 25)
k4 <- kmeans(resultsca, centers = 4, nstart = 25)
k5 <- kmeans(resultsca, centers = 5, nstart = 25)
p1 <- fviz_cluster(k2, geom = "point", data = resultsca) + ggtitle('k = 2')
p2 <- fviz_cluster(k3, geom = "point", data = resultsca) + ggtitle('k = 3')
p3 <- fviz_cluster(k4, geom = "point", data = resultsca) + ggtitle('k = 4')
p4 <- fviz_cluster(k5, geom = "point", data = resultsca) + ggtitle('k = 5')
grid.arrange(p1, p2, p3, p4, nrow=2)
df$cluster <- k.means.fit$cluster
Normal customers give lower ratings and members give higher ratings.
More purchases made by females than males across all product lines in general.
No significant trend was observed for sales from January to March.
Saturday is the day with highest sales, Monday is the day with lowest sales.
Various factors contribute to customers choosing to pay with either credit card, cash, or E-wallet
Customer segmentation can be seen according to the clustering.