This dataset was sourced from Aung Pyae on Kaggle, who obtained it from a supermarket chain in Myanmar. This data is filled with sales components in 3 supemarket branches and includes 17 variables.
Invoice id: Computer generated sales slip invoice identification number
Branch: Branch of supercenter (3 branches are available identified by A, B and C).
City: Location of supercenters
Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
Gender: Gender type of customer
Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
Unit price: Price of each product in $
Quantity: Number of products purchased by customer
Tax: 5% tax fee for customer buying
Total: Total price including tax
Date: Date of purchase (Record available from January 2019 to March 2019)
Time: Purchase time (10am to 9pm)
Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)
COGS: Cost of goods sold
Gross margin percentage: Gross margin percentage
Gross income: Gross income
Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)
## [1] 1000 17
## [1] "Invoice.ID" "Branch"
## [3] "City" "Customer.type"
## [5] "Gender" "Product.line"
## [7] "Unit.price" "Quantity"
## [9] "Tax.5." "Total"
## [11] "Date" "Time"
## [13] "Payment" "cogs"
## [15] "gross.margin.percentage" "gross.income"
## [17] "Rating"
From our inspection we can conclude : * supermarket data contain 1000 of rows and 17 of coloumns * Each of column name : “Invoice.ID”, “Branch”, “City”, “Customer.type”, “Gender”, “Product.line”, “Unit.price”,“Quantity”, “Tax.5”, “Total”, “Date”, “Time” , “Payment”, “cogs” , “gross.margin.percentage”, “gross.income”, “Rating”.
## 'data.frame': 1000 obs. of 17 variables:
## $ Invoice.ID : Factor w/ 1000 levels "101-17-6199",..: 815 143 654 19 340 734 316 265 703 727 ...
## $ Branch : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
## $ City : Factor w/ 3 levels "Mandalay","Naypyitaw",..: 3 2 3 3 3 2 3 2 3 1 ...
## $ Customer.type : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
## $ Product.line : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax.5. : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
## $ Date : Factor w/ 89 levels "1/1/2019","1/10/2019",..: 27 88 82 20 58 77 49 48 2 44 ...
## $ Time : Factor w/ 506 levels "10:00","10:01",..: 147 24 156 486 30 394 215 78 342 160 ...
## $ Payment : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
The feature invoice id is of no use . So let us drop the feature. We also need to check if there are any missing values in the data set. Additionally lets take a look at the summary statistics of the same. Additionally, lets convert the Date to a standardized format.
supermarket_clean <- supermarket[,-1]
supermarket_clean$Date <- as.Date(supermarket_clean$Date, "%m/%d/%y")
year(supermarket_clean$Date) <- 2019
sapply(supermarket_clean, function(x) sum(is.na(x)))## Branch City Customer.type
## 0 0 0
## Gender Product.line Unit.price
## 0 0 0
## Quantity Tax.5. Total
## 0 0 0
## Date Time Payment
## 0 0 0
## cogs gross.margin.percentage gross.income
## 0 0 0
## Rating
## 0
## Branch City Customer.type Gender
## A:340 Mandalay :332 Member:501 Female:501
## B:332 Naypyitaw:328 Normal:499 Male :499
## C:328 Yangon :340
##
##
##
##
## Product.line Unit.price Quantity Tax.5.
## Electronic accessories:170 Min. :10.08 Min. : 1.00 Min. : 0.5085
## Fashion accessories :178 1st Qu.:32.88 1st Qu.: 3.00 1st Qu.: 5.9249
## Food and beverages :174 Median :55.23 Median : 5.00 Median :12.0880
## Health and beauty :152 Mean :55.67 Mean : 5.51 Mean :15.3794
## Home and lifestyle :160 3rd Qu.:77.94 3rd Qu.: 8.00 3rd Qu.:22.4453
## Sports and travel :166 Max. :99.96 Max. :10.00 Max. :49.6500
##
## Total Date Time Payment
## Min. : 10.68 Min. :2019-01-01 14:42 : 7 Cash :344
## 1st Qu.: 124.42 1st Qu.:2019-01-24 19:48 : 7 Credit card:311
## Median : 253.85 Median :2019-02-13 17:38 : 6 Ewallet :345
## Mean : 322.97 Mean :2019-02-14 10:11 : 5
## 3rd Qu.: 471.35 3rd Qu.:2019-03-08 11:40 : 5
## Max. :1042.65 Max. :2019-03-30 11:51 : 5
## (Other):965
## cogs gross.margin.percentage gross.income Rating
## Min. : 10.17 Min. :4.762 Min. : 0.5085 Min. : 4.000
## 1st Qu.:118.50 1st Qu.:4.762 1st Qu.: 5.9249 1st Qu.: 5.500
## Median :241.76 Median :4.762 Median :12.0880 Median : 7.000
## Mean :307.59 Mean :4.762 Mean :15.3794 Mean : 6.973
## 3rd Qu.:448.90 3rd Qu.:4.762 3rd Qu.:22.4453 3rd Qu.: 8.500
## Max. :993.00 Max. :4.762 Max. :49.6500 Max. :10.000
##
Summary
supermarket_clean <- supermarket_clean %>%
mutate(revenue = Unit.price*Quantity) %>%
group_by(City)
supermarket_cleanplot_supermarket <- supermarket_clean %>%
ggplot(aes(x = City,
y = Total,
fill = Product.line)) +
facet_wrap(~Customer.type) +
theme_classic() +
geom_col(position = "dodge")+
scale_y_continuous(limits = c(0, 1200))
ggplotly(plot_supermarket)## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Dari bar plot diatas dapat terlihat pelanggan ber tipe Member dapat membeli satu produk tertinggi di kategori Fashion accsories pada kota Naypitaw dan terendah di kategori Health and beauty pada kota Yangon, dan pelanggan ber tipe Normal dapat membeli satu produk tertinggi di kategori Fashion accessories pada kota Yangon dan ter rendah di kategori Food and beverages pada kota Mandalay.
supermarket_total <- supermarket_clean %>%
select(City,Total) %>%
group_by(City) %>%
summarise(Total = sum(Total)) %>%
arrange(-Total)## `summarise()` ungrouping output (override with `.groups` argument)
plot_supermarket_total <- supermarket_total %>%
ggplot(aes(x = City,
y = Total,
fill = City)) +
theme_pander() +
geom_col(position = "dodge") +
scale_y_continuous(limits = c(0, 120000))
ggplotly(plot_supermarket_total)liatlah plot diatas, plot tersebut menunjukan kota Naypitaw lah yang memberikan total penjualan ter tinggi dan ter rendah pada kota Mandalay.
bxpBR <- supermarket_clean %>%
ggplot(data = supermarket_clean, mapping = aes(x = Branch, y = Rating)) +
geom_boxplot(notch = TRUE, mapping = aes(fill = Branch)) +
theme_classic() + ggtitle("Box Plot Showing Relationship between Branch and Ratings") +
xlab("Branch")+ ylab("Rating") +
geom_hline(mapping = aes(yintercept = 7.1), linetype = "dashed") +
geom_hline(mapping = aes(yintercept = 6.7), linetype = "dashed") +
geom_text(mapping = aes(x = "A", y = 7.5, label = "Mandalay")) +
geom_text(mapping = aes(x = "B", y = 7.5, label = "Naypyitaw")) +
geom_text(mapping = aes(x = "C", y = 7.5, label = "Yangon")) +
scale_y_continuous(breaks = c(4,5.3,5.6,6.7,7.1,8.2,8.5,10))
ggplotly(bxpBR)Boxplot diatas menunjukan bahwa distribusi rating pada kota Mandalay dan Yangon memiliki distribusi yang sama, tersebar di Q1:5.6 - Q3:8.5 dan memiliki nilai tengah di 7.1, sedangkan kota Naypyitaw jauh lebih rendah di banding 2 kota yang lain, persebaran nya di Q1:5.3 - Q3:8.2 dan memiliki nilai tengah di 6.7.
dist_cust <- ggplot(data = supermarket_clean, mapping = aes(x = Branch)) +
geom_bar(mapping = aes(fill = Customer.type)) + theme_linedraw() +
ggtitle("Distribution of Customers in Each Branch Based on Gender") +
xlab("Branch") + ylab("Number of Customers") + facet_wrap(supermarket_clean$Gender) +
scale_y_continuous(breaks = c(0, 25, 50, 75, 100, 125, 150, 175))
ggplotly(dist_cust)Dari plot diatas kita dapat mengetahui jumlah persebaran jenis pelanggan berdasarkan jenis kelamin, dapat terlihat dari ke 3 Branch bahwa Wanita lebih banyak menjadi pelanggan Member ketimbang Pria.
ggplot(data = supermarket_clean, aes(x = Quantity,
y = Customer.type,
fill = Payment
)) +
ggtitle("A")+
theme_pander()+
geom_col(position = "stack") +
scale_x_continuous(limits = c(0, 2500))## Warning: Removed 94 rows containing missing values (geom_col).
Dapat terlihat dari plot diatas bahwa tipe pelanggan Member, cinderung membeli produk menggunakan Credit card sedangkan tipe pelanggan Normal lebih cinderung menggunakan Ewallet dan Cash.
cust_type_payment <- supermarket_clean %>%
ggplot(data = supermarket_clean, mapping = aes(x = Branch)) +
geom_bar(mapping = aes(fill = Customer.type)) +
theme_linedraw() + ggtitle("Distribution of Customers in Each Branch Based on Payment Mode") +
xlab("Branch") + ylab("Number of Customers")
cust_type_payment + facet_wrap(supermarket_clean$Payment) +
scale_y_continuous(breaks = c(0, 25, 50, 75, 100, 125, 150, 175)) +
labs(caption = "Cash(Members) - 56/53/59 \n CreditCard(Members) - 49/63/60 \n Ewallet(Members) - 62/49/50")plot diatas menunjukan jumlah distribusi tipe pelanggan dari setiap Branch.
#create new theme
theme_a <- theme(legend.key = element_rect(fill="black"),
legend.background = element_rect(color="white", fill="#263238"),
plot.subtitle = element_text(size=6, color="white"),
panel.background = element_rect(fill="#dddddd"),
panel.border = element_rect(fill=NA),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(color="darkgrey", linetype=2),
panel.grid.minor.y = element_blank(),
plot.background = element_rect(fill="#263238"),
text = element_text(color="white"),
axis.text = element_text(color="white")
) Jan_sales <- supermarket_clean %>% mutate(month = month(Date)) %>% filter(month == 1)
salesplot_per_month1 <- ggplot(data = Jan_sales, mapping = aes(x = Branch, y = Total)) +
geom_boxplot(notch = TRUE, mapping = aes(fill = Branch), outlier.color = "red", show.legend = FALSE) +
theme_linedraw() + ggtitle("January Sales") +
xlab("Branches") + ylab("Total Sales")
Feb_sales <- supermarket_clean %>% mutate(month = month(Date)) %>% filter(month == 2)
salesplot_per_month2 <- ggplot(data = Feb_sales, mapping = aes(x = Branch, y = Total)) +
geom_boxplot(notch = TRUE, mapping = aes(fill = Branch), outlier.color = "red", show.legend = FALSE) +
theme_linedraw() + ggtitle("February Sales") +
xlab("Branches") + ylab("Total Sales")
Mar_sales <- supermarket_clean %>% mutate(month = month(Date)) %>% filter(month == 3)
salesplot_per_month3 <- ggplot(data = Mar_sales, mapping = aes(x = Branch, y = Total)) +
geom_boxplot(notch = TRUE, mapping = aes(fill = Branch), outlier.color = "red") +
theme_linedraw() + ggtitle("March Sales") +
xlab("Branches") + ylab("Total Sales")
grid.arrange(salesplot_per_month1,salesplot_per_month2,salesplot_per_month3)plot diatas menjelaskan sebaran jumlah penjualan pada bulan januari sampei maret di tiga cabang berbeda, dapat dilihat bahwa penjualan Branch C (Naypyitaw) cinderung stabil dan mengalami sedikit kenaikan pada bulan maret, berbeda dengan Branch A (Yangon) yang mengalami penurunan tiap bulan nya begitu pula dengan Branch B (Mandalay) .
bxp1 <- ggplot(data = Jan_sales, mapping = aes(x = Product.line, y = Total)) +
geom_boxplot(mapping = aes(fill = Product.line), outlier.color = "red", show.legend = FALSE) +
scale_x_discrete(labels = c("Elc", "Fsh", "Fod", "Hel", "Hme", "Spr")) +
theme_linedraw() + ggtitle("January Sales Based on Product Lines") +
xlab("Product Lines") + ylab("Total Sales")
bxp2 <- ggplot(data = Feb_sales, mapping = aes(x = Product.line, y = Total)) +
geom_boxplot(mapping = aes(fill = Product.line), outlier.color = "red", show.legend = FALSE) +
scale_x_discrete(labels = c("Elc", "Fsh", "Fod", "Hel", "Hme", "Spr")) +
theme_linedraw() + ggtitle("February Sales Based on Product Lines") +
xlab("Product Lines") + ylab("Total Sales")
bxp3 <- ggplot(data = Mar_sales, mapping = aes(x = Product.line, y = Total))+
geom_boxplot(mapping = aes(fill = Product.line), outlier.color = "red") +
scale_x_discrete(labels = c("Elc", "Fsh", "Fod", "Hel", "Hme", "Spr")) +
theme_linedraw() + ggtitle("March Sales Based on Product Lines") +
xlab("Product Lines") + ylab("Total Sales")
grid.arrange(bxp1,bxp2,bxp3)plot diatas menunjukan Jumlah penjualan tiap produk pada tiap bulan, dapat disimpulkan bahwa penjualan tertinggi pada kategori produk Food and beverages dikarnakan kemungkinan pada bulan maret terdapat banyak outlier Total di sekitar 750 - 900, lalu disusul kedua di kategori Sports and travel dilihat dari distribusi box nya konstan di 2 bulan pertama dan naik pada bulan terakhir.
supermarket_total_product <- supermarket_clean %>%
select(Product.line, Total) %>%
group_by(Product.line) %>%
summarise(Total_Sales = sum(Total)) %>%
arrange(-Total_Sales)## Adding missing grouping variables: `City`
## `summarise()` ungrouping output (override with `.groups` argument)
supermarket_total_Q <- supermarket_clean %>%
select(Product.line, Quantity) %>%
group_by(Product.line) %>%
summarise(Total_Quantity = sum(Quantity)) %>%
arrange(-Total_Quantity)## Adding missing grouping variables: `City`
## `summarise()` ungrouping output (override with `.groups` argument)
Tabel diatas berisikan jumlah keseluruhan dari tiap kategori product yang terjual di tiga cabang.
plot_supermarket_totalQ <- supermarket_total_Q %>%
ggplot(aes(x = Product.line,
y = Total_Quantity,
fill = Product.line)) +
theme_a +
geom_col(position = "dodge")
# scale_y_continuous(limits = c(0, 1200))
ggplotly(plot_supermarket_totalQ)Terlihat diatas bahwa dari ke tiga Branch produk yang paling banyak di jual merupakan produk berkategori Electonic accessories, dapat disimpulkan bahwa tidak berarti bahwa Quantity suatu produk banyak terjual akan mengasilkan Total sales yang besar juga.
supermarket_Naypyitaw <- supermarket_clean[supermarket_clean$City == "Naypyitaw",]
supermarket_Naypyitawsupermarket_total_product_Naypyitaw <- supermarket_Naypyitaw %>%
select(Product.line, Quantity) %>%
group_by(Product.line) %>%
summarise(Total_Quantity = sum(Quantity)) %>%
arrange(-Total_Quantity)## Adding missing grouping variables: `City`
## `summarise()` ungrouping output (override with `.groups` argument)
plot_supermarket_totalQC <- supermarket_total_product_Naypyitaw %>%
ggplot(aes(x = Product.line,
y = Total_Quantity,
fill = Product.line)) +
theme_a +
geom_col(position = "dodge")
# scale_y_continuous(limits = c(0, 1200))
ggplotly(plot_supermarket_totalQC)supermarket_total_product_Yangon <- supermarket_Yangon %>%
select(Product.line, Quantity) %>%
group_by(Product.line) %>%
summarise(Total_Quantity = sum(Quantity)) %>%
arrange(-Total_Quantity)## Adding missing grouping variables: `City`
## `summarise()` ungrouping output (override with `.groups` argument)
plot_supermarket_totalQA <- supermarket_total_product_Yangon %>%
ggplot(aes(x = Product.line,
y = Total_Quantity,
fill = Product.line)) +
theme_a +
geom_col(position = "dodge")
# scale_y_continuous(limits = c(0, 1200))
ggplotly(plot_supermarket_totalQA)supermarket_Mandalay <- supermarket_clean[supermarket_clean$City == "Mandalay",]
supermarket_Mandalaysupermarket_total_product_Mandalay <- supermarket_Mandalay %>%
select(Product.line, Quantity) %>%
group_by(Product.line) %>%
summarise(Total_Quantity = sum(Quantity)) %>%
arrange(-Total_Quantity)## Adding missing grouping variables: `City`
## `summarise()` ungrouping output (override with `.groups` argument)
total_revenue_per_day <- data.frame(xtabs(formula=revenue~Date, data = supermarket_clean))
total_revenue_per_day$Date <- as.Date(total_revenue_per_day$Date)
plotR <- total_revenue_per_day %>%
ggplot(data = total_revenue_per_day, mapping = aes(x = Date, y = Freq)) +
geom_line() +
geom_point() +
theme_pander() +
ggtitle("Time Series Relationship for the Total Revenue per day") +
xlab("Date")+ ylab("Total revenue Per Day")
ggplotly(plotR)Dari terlihat dari plot diatas bahwa setiap tanggal 4, 18 dan 13 selalu mengalami penurunan pendapatan dan terparah di tanggal 13 bulan Februari.
A <- supermarket_clean %>% filter(Branch == "A")
revenue_A <- data.frame(xtabs(formula = revenue~Date, data = A))
revenue_A$Date <- as.Date(revenue_A$Date)
B <- supermarket_clean %>% filter(Branch == "B")
revenue_B <- data.frame(xtabs(formula = revenue~Date, data = B))
revenue_B$Date <- as.Date(revenue_B$Date)
C <- supermarket_clean %>% filter(Branch == "C")
revenue_C <- data.frame(xtabs(formula = revenue~Date, data = C))
revenue_C$Date <- as.Date(revenue_C$Date)
plotA <- ggplot(data = revenue_A, mapping = aes(x = Date, y = Freq)) +
geom_line() + theme_linedraw() +
ggtitle("Time Series Relationship for the Total Revenue per day in Branch A") +
xlab("Date")+ ylab("Total Revenue Per Day")
plotB <- ggplot(data = revenue_B, mapping = aes(x = Date, y = Freq)) +
geom_line() + theme_linedraw() +
ggtitle("Time Series Relationship for the Total Revenue per day in Branch B") +
xlab("Date")+ ylab("Total Revenue Per Day")
plotC <- ggplot(data = revenue_C, mapping = aes(x = Date, y = Freq)) +
geom_line() + theme_linedraw() +
ggtitle("Time Series Relationship for the Total Revenue per day in Branch C") +
xlab("Date")+ ylab("Total Revenue Per Day")
grid.arrange(plotA, plotB, plotC)