true

Find interesting dataset which consists time series and prepare short report (in R Markdown) which will consists: - short description of the dataset, - 4 lineplots which will present interesting relationships between variables, - brief comments which describes obtained results.

Then, edit theme of the graphs and all scales of the graph and prepare publication-ready plots.

Introduction

The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. Predictive data analytics methods are easy to apply with this dataset.

Source: https://www.kaggle.com/aungpyaeap/supermarket-sales

Data Variables

  1. Invoice id | Computer generated sales slip invoice identification number
  2. Branch | Branch of supercenter (A/B/C).
  3. City | Location of supercenters
  4. Customer type | Type of customers,Members/Normal with or without member card.
  5. Gender | Gender type of customer
  6. Product line | General item categorization groups Electronic accessories/ Fashion accessories/Food and beverages/Health and beauty/Home and lifestyle/Sports and travel
  7. Unit price | Price of each product in $
  8. Quantity | Number of products purchased by customer
  9. Tax | 5% tax fee for customer buying
  10. Total | Total price including tax
  11. Date | Date of purchase (MM/DD/YYYY)
  12. Time | Purchase time (10am to 9pm)
  13. Payment | Payment used by customer for purchase (Cash/Credit card/Ewallet)
  14. COGS | Cost of goods sold
  15. Gross margin percentage| Gross margin percentage
  16. Gross income | Gross income
  17. Rating | Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)
Sys.setenv(LANG = "en")

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(readr)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
supermarket_sale <- read.csv('/Users/jeank4723/Desktop/Advance VR/1/Data/supermarket_sale.csv')

head(supermarket_sale)
##    Invoice.ID Branch      City Customer.type Gender           Product.line
## 1 750-67-8428      A    Yangon        Member Female      Health and beauty
## 2 226-31-3081      C Naypyitaw        Normal Female Electronic accessories
## 3 631-41-3108      A    Yangon        Normal   Male     Home and lifestyle
## 4 123-19-1176      A    Yangon        Member   Male      Health and beauty
## 5 373-73-7910      A    Yangon        Normal   Male      Sports and travel
## 6 699-14-3026      C Naypyitaw        Normal   Male Electronic accessories
##   Unit.price Quantity  Tax.5.    Total      Date  Time     Payment   cogs
## 1      74.69        7 26.1415 548.9715  1/5/2019 13:08     Ewallet 522.83
## 2      15.28        5  3.8200  80.2200  3/8/2019 10:29        Cash  76.40
## 3      46.33        7 16.2155 340.5255  3/3/2019 13:23 Credit card 324.31
## 4      58.22        8 23.2880 489.0480 1/27/2019 20:33     Ewallet 465.76
## 5      86.31        7 30.2085 634.3785  2/8/2019 10:37     Ewallet 604.17
## 6      85.39        7 29.8865 627.6165 3/25/2019 18:30     Ewallet 597.73
##   gross.margin.percentage gross.income Rating
## 1                4.761905      26.1415    9.1
## 2                4.761905       3.8200    9.6
## 3                4.761905      16.2155    7.4
## 4                4.761905      23.2880    8.4
## 5                4.761905      30.2085    5.3
## 6                4.761905      29.8865    4.1
supermarket_sale <- na.omit(supermarket_sale)



supermarket_sale$Date <- as.Date(supermarket_sale$Date, format = "%m/%d/%Y")

str(supermarket_sale)
## 'data.frame':    1000 obs. of  17 variables:
##  $ Invoice.ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr  "A" "C" "A" "A" ...
##  $ City                   : chr  "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
##  $ Customer.type          : chr  "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr  "Female" "Female" "Male" "Male" ...
##  $ Product.line           : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax.5.                 : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...
##  $ Date                   : Date, format: "2019-01-05" "2019-03-08" ...
##  $ Time                   : chr  "13:08" "10:29" "13:23" "20:33" ...
##  $ Payment                : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
colnames(supermarket_sale)
##  [1] "Invoice.ID"              "Branch"                 
##  [3] "City"                    "Customer.type"          
##  [5] "Gender"                  "Product.line"           
##  [7] "Unit.price"              "Quantity"               
##  [9] "Tax.5."                  "Total"                  
## [11] "Date"                    "Time"                   
## [13] "Payment"                 "cogs"                   
## [15] "gross.margin.percentage" "gross.income"           
## [17] "Rating"

4 lineplots

1. Date and Total

In the plot, there are time series and gross income. We can see how the sales fluctuate in this supermarket.

gross_income_date <- supermarket_sale  %>% 
  group_by(Date) %>% 
  summarise(gross.income = sum(gross.income))


p1 <- ggplot(data = gross_income_date, aes(x = Date, y = gross.income))   

p1 + 
  geom_point(color = "orange") +
  geom_line( color = "skyblue", size = 0.5) +
  theme_minimal()  +
  labs(title = "Daily Gross Income", x = "Date", y = "Gross Income") +
  scale_x_date(labels = date_format("%Y-%m"))

2. Total and Gender

According to the plot, we can see that the difference in male customers’ shopping habits in supermarkets is greater than that of women. The way male customers buy is to buy a lot or not to buy.

stat2 <- supermarket_sale %>% 
  group_by(Date, Gender) %>% 
  summarise(Total = mean(Total))
## `summarise()` has grouped output by 'Date'. You can override using the `.groups` argument.
p2 <- ggplot(data = stat2, aes(x = Date, y = Total, color = factor(Gender)))
   
p2 +
  geom_point() +
  geom_line() +
  theme_bw()+
  labs(title = "The total price divide by Gender", 
       x = "Date", 
       y = "Total Price", 
       colour = "Gender") +
  scale_x_date(labels = date_format("%Y-%m"))

3. Gross Income and Customer Type

In the plot, the orange line indicated to supermarket customers with membership, and the blue line are not the member of the branch A supermarket. The supermarket received more gross income from customers with membership than customers without it. We can not obviously tell that customers with membership shop more than customers without it. However, we can see that membership can attract customer to come purchasing.

stat3 <- supermarket_sale %>%
  filter(Branch == "A") %>% 
  group_by(Date, Customer.type) %>% 
  summarise(gross.income = mean(gross.income))
## `summarise()` has grouped output by 'Date'. You can override using the `.groups` argument.
p3 <- ggplot(data = stat3, aes(x = Date, y = gross.income, color = factor(Customer.type)))
   
p3 +
  geom_point() +
  geom_line() +
  geom_path()+
  labs(title = "The Daily Gross Income devied by Mebership", 
       x = "Date", 
       y = "Gross Income", 
       colour = "Customer Type") +
  scale_x_date(labels = date_format("%Y-%m")) +
  scale_color_manual(values=c('Orange','skyblue'))

4. Average Total and Average Rating

In the plot we can observe that the daily total price and the rating from the customer. In February, there was a day reach the highest total price. However, the cumtomer rating was not follow the

stat4 <- supermarket_sale  %>% 
  group_by(Date) %>% 
  summarise(Total = mean(Total), Rating = mean(Rating))

p4 <- ggplot(data = stat4, aes(x = Date)) 

p4 +
  geom_line(aes(y = Total), color = "steelblue") + # I series
  geom_line(aes(y = Rating*100), color = "red") + # II series (scaled-up)
  theme_grey() +
  labs(title = "The Daliy Total Price and Rating by Customers", 
       x = "Date", 
       y = "Total Price") +
  scale_y_continuous(sec.axis = sec_axis(~./100, name = 'Rating 1-10'))+
  theme(legend.position = 'top')