Find interesting dataset which consists time series and prepare short report (in R Markdown) which will consists: - short description of the dataset, - 4 lineplots which will present interesting relationships between variables, - brief comments which describes obtained results.
Then, edit theme of the graphs and all scales of the graph and prepare publication-ready plots.
The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. Predictive data analytics methods are easy to apply with this dataset.
Sys.setenv(LANG = "en")
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(readr)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
supermarket_sale <- read.csv('/Users/jeank4723/Desktop/Advance VR/1/Data/supermarket_sale.csv')
head(supermarket_sale)
## Invoice.ID Branch City Customer.type Gender Product.line
## 1 750-67-8428 A Yangon Member Female Health and beauty
## 2 226-31-3081 C Naypyitaw Normal Female Electronic accessories
## 3 631-41-3108 A Yangon Normal Male Home and lifestyle
## 4 123-19-1176 A Yangon Member Male Health and beauty
## 5 373-73-7910 A Yangon Normal Male Sports and travel
## 6 699-14-3026 C Naypyitaw Normal Male Electronic accessories
## Unit.price Quantity Tax.5. Total Date Time Payment cogs
## 1 74.69 7 26.1415 548.9715 1/5/2019 13:08 Ewallet 522.83
## 2 15.28 5 3.8200 80.2200 3/8/2019 10:29 Cash 76.40
## 3 46.33 7 16.2155 340.5255 3/3/2019 13:23 Credit card 324.31
## 4 58.22 8 23.2880 489.0480 1/27/2019 20:33 Ewallet 465.76
## 5 86.31 7 30.2085 634.3785 2/8/2019 10:37 Ewallet 604.17
## 6 85.39 7 29.8865 627.6165 3/25/2019 18:30 Ewallet 597.73
## gross.margin.percentage gross.income Rating
## 1 4.761905 26.1415 9.1
## 2 4.761905 3.8200 9.6
## 3 4.761905 16.2155 7.4
## 4 4.761905 23.2880 8.4
## 5 4.761905 30.2085 5.3
## 6 4.761905 29.8865 4.1
supermarket_sale <- na.omit(supermarket_sale)
supermarket_sale$Date <- as.Date(supermarket_sale$Date, format = "%m/%d/%Y")
str(supermarket_sale)
## 'data.frame': 1000 obs. of 17 variables:
## $ Invoice.ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr "A" "C" "A" "A" ...
## $ City : chr "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
## $ Customer.type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Product.line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax.5. : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
## $ Date : Date, format: "2019-01-05" "2019-03-08" ...
## $ Time : chr "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
colnames(supermarket_sale)
## [1] "Invoice.ID" "Branch"
## [3] "City" "Customer.type"
## [5] "Gender" "Product.line"
## [7] "Unit.price" "Quantity"
## [9] "Tax.5." "Total"
## [11] "Date" "Time"
## [13] "Payment" "cogs"
## [15] "gross.margin.percentage" "gross.income"
## [17] "Rating"
In the plot, there are time series and gross income. We can see how the sales fluctuate in this supermarket.
gross_income_date <- supermarket_sale %>%
group_by(Date) %>%
summarise(gross.income = sum(gross.income))
p1 <- ggplot(data = gross_income_date, aes(x = Date, y = gross.income))
p1 +
geom_point(color = "orange") +
geom_line( color = "skyblue", size = 0.5) +
theme_minimal() +
labs(title = "Daily Gross Income", x = "Date", y = "Gross Income") +
scale_x_date(labels = date_format("%Y-%m"))
According to the plot, we can see that the difference in male customers’ shopping habits in supermarkets is greater than that of women. The way male customers buy is to buy a lot or not to buy.
stat2 <- supermarket_sale %>%
group_by(Date, Gender) %>%
summarise(Total = mean(Total))
## `summarise()` has grouped output by 'Date'. You can override using the `.groups` argument.
p2 <- ggplot(data = stat2, aes(x = Date, y = Total, color = factor(Gender)))
p2 +
geom_point() +
geom_line() +
theme_bw()+
labs(title = "The total price divide by Gender",
x = "Date",
y = "Total Price",
colour = "Gender") +
scale_x_date(labels = date_format("%Y-%m"))
In the plot, the orange line indicated to supermarket customers with membership, and the blue line are not the member of the branch A supermarket. The supermarket received more gross income from customers with membership than customers without it. We can not obviously tell that customers with membership shop more than customers without it. However, we can see that membership can attract customer to come purchasing.
stat3 <- supermarket_sale %>%
filter(Branch == "A") %>%
group_by(Date, Customer.type) %>%
summarise(gross.income = mean(gross.income))
## `summarise()` has grouped output by 'Date'. You can override using the `.groups` argument.
p3 <- ggplot(data = stat3, aes(x = Date, y = gross.income, color = factor(Customer.type)))
p3 +
geom_point() +
geom_line() +
geom_path()+
labs(title = "The Daily Gross Income devied by Mebership",
x = "Date",
y = "Gross Income",
colour = "Customer Type") +
scale_x_date(labels = date_format("%Y-%m")) +
scale_color_manual(values=c('Orange','skyblue'))
In the plot we can observe that the daily total price and the rating from the customer. In February, there was a day reach the highest total price. However, the cumtomer rating was not follow the
stat4 <- supermarket_sale %>%
group_by(Date) %>%
summarise(Total = mean(Total), Rating = mean(Rating))
p4 <- ggplot(data = stat4, aes(x = Date))
p4 +
geom_line(aes(y = Total), color = "steelblue") + # I series
geom_line(aes(y = Rating*100), color = "red") + # II series (scaled-up)
theme_grey() +
labs(title = "The Daliy Total Price and Rating by Customers",
x = "Date",
y = "Total Price") +
scale_y_continuous(sec.axis = sec_axis(~./100, name = 'Rating 1-10'))+
theme(legend.position = 'top')