Group Members
Student ID | Student Name |
---|---|
17217922 | Farah Husna Binti Ramli (Leader) |
S2001169 | Nabilah Ahmad Kamal |
17120332 | Nor Azyra Binti Omar |
Definition: Exploratory Data Analysis refers to the critical process of performing initial investigations
on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
Objective: To test a hypothesis or check assumptions related to different types of problems for business analytics using the supermarket sales data.
Problems/Questions:
1. Descriptive - What is the top payment method based on the product lines?
2. Prediction - When is the best time to offer discounts?
3. Correlation - Is there a relationship between total sales on weekdays vs weekend?
About the raw data
Title: The Supermarket Sales Data
Year: 2019
Source: Kaggle: https://www.kaggle.com/aungpyaeap/supermarket-sales
About: The growth of supermarkets in most populated cities are increasing and market competitions are also high. This dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.
Attributes:
Tax
: 5% tax fee for customer buying Total
: Total price including tax Date
: Date of purchase (Record available from January 2019 to March 2019) Time
: Purchase time (10am to 9pm) Payment
: Payment used by customer for purchase (Cash, Credit card and Ewallet)COGS
: Cost of goods sold Gross margin percentage
: Gross margin percentage Gross income
: Gross income Rating
: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10) #Load the dataset
supermarket_sales <- read.csv("supermarket_sales.csv")
Check the dimension of the database.
suppressWarnings(library(dplyr))
dim(supermarket_sales)
## [1] 1000 17
Preview some of the data to get an overview.
head(supermarket_sales)
Use descriptive analysis to understand the data.
str(supermarket_sales)
## 'data.frame': 1000 obs. of 17 variables:
## $ Invoice.ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr "A" "C" "A" "A" ...
## $ City : chr "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
## $ Customer.type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Product.line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax.5. : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
#to show min & max unit price for each of product line
supermarket_sales %>%
group_by(supermarket_sales$Product.line) %>%
summarise(
total_inv = n(),
min_price = min(Unit.price),
max_price = max(Unit.price),
avg_price = mean(Unit.price),
)
Is a significant step to process and prepare the data for analysis.
1- Confirm the class of the loaded dataset is a data frame.
class(supermarket_sales)
## [1] "data.frame"
2- Check for duplication sales transactions based on Invoice_ID
.
sum(duplicated(supermarket_sales[,1]))
## [1] 0
3- Check for missing values across all columns in the data frame.
sapply(supermarket_sales,function(x) table(as.character(x) =="")["TRUE"])
## Invoice.ID.NA Branch.NA
## NA NA
## City.NA Customer.type.NA
## NA NA
## Gender.NA Product.line.NA
## NA NA
## Unit.price.NA Quantity.NA
## NA NA
## Tax.5..NA Total.NA
## NA NA
## Date.NA Time.NA
## NA NA
## Payment.NA cogs.NA
## NA NA
## gross.margin.percentage.NA gross.income.NA
## NA NA
## Rating.NA
## NA
sapply(supermarket_sales,function(x) sum(is.na(x)))
## Invoice.ID Branch City
## 0 0 0
## Customer.type Gender Product.line
## 0 0 0
## Unit.price Quantity Tax.5.
## 0 0 0
## Total Date Time
## 0 0 0
## Payment cogs gross.margin.percentage
## 0 0 0
## gross.income Rating
## 0 0
4- Convert date from character to datetime.
typeof(supermarket_sales$Date) #check type of object for date
## [1] "character"
supermarket_sales$Date<-as.Date(supermarket_sales$Date, format = "%m/%d/%Y")
5- Remove irrelevant information- only keep columns significant to sales analysis for the business.
supermarket_sales<-subset(supermarket_sales,
select=c("Branch","City","Customer.type","Gender",
"Product.line","Quantity","Total","Date","Time","Payment"))
1- Create new columns for month and day.
supermarket_sales<-transform(supermarket_sales,Months= months(supermarket_sales$Date))
supermarket_sales<-transform(supermarket_sales,Days= weekdays(supermarket_sales$Date))
2- Re-arrange the columns & assigned it to a new processed dataset.
SS_df_cleaned<-supermarket_sales[,c(1:8,11:12,9:10)]
3- Check the dimension of the transformed database.
dim(SS_df_cleaned) # by rows & columns
## [1] 1000 12
4- Get the details of the transformed database to confirm the changes done.
library(funModeling) #for analysis
status(SS_df_cleaned)
1- Load required library for analysis and visualization.
library(ggplot2)
library(scales)
library(RColorBrewer)
2- Analyzing categorical variables:
freq(SS_df_cleaned$Branch)
freq(SS_df_cleaned$City)
freq(SS_df_cleaned$Customer.type)
freq(SS_df_cleaned$Product.line)
freq(SS_df_cleaned$Payment)
3- Analyzing numerical variables:
profiling_num(SS_df_cleaned)
plot_num(SS_df_cleaned)
4- Analyzing categorical & numerical variables:
#Analysis by Customer type-Normal
SS_df_analysed<-subset(SS_df_cleaned,SS_df_cleaned$Customer.type=="Normal")
normal<-ggplot(data = SS_df_analysed, aes(x=Product.line,y=Total, color=Product.line)) +
geom_boxplot()+
scale_color_brewer(palette="Dark2") +
geom_jitter(shape=16, position=position_jitter(0.2))+
theme(axis.text.x = element_text(angle = 25))+
labs(title = 'Which product lines do Non-members spend the most',
y='Total Sales',x='Product Lines')
normal
#Analysis by Customer type-Member
SS_df_analysed<-subset(SS_df_cleaned,SS_df_cleaned$Customer.type=="Member")
member<-ggplot(data = SS_df_analysed, aes(x=Product.line,y=Total, color=Product.line)) +
geom_boxplot()+
scale_color_brewer(palette="Dark2") +
geom_jitter(shape=16, position=position_jitter(0.2))+
theme(axis.text.x = element_text(angle = 25))+
labs(title = 'Which product lines do Members spend the most',
y='Total Sales',x='Product Lines')
member
The distribution of sales across categories for customers with membership is more equally spread out compare to those of normal customers.
Non-members maximum spending across categories are mostly less than or close to $500 only while customers with membership spend the most in health & beauty line.
There are big differences in the distributions of sales between non-members and members in fashion accessories, health & beauty and sports & travel product lines.
Answer different types of business questions for analysis
Plotting the bar graph for the frequency of each payment methods.
# count Payment and change the count values into percentage
plotdata <- SS_df_cleaned %>% count(Payment) %>%
mutate(pct = n / sum(n), # count / sum of count
pctlabel = paste0(round(pct*100), "%")) #percentage form
# plot the bars for Payment in ascending order
ggplot(plotdata,
aes(x = reorder(Payment, pct),
y = pct)) +
geom_bar(stat = "identity",
fill = "#6fd1d1",
color = "black") +
geom_text(aes(label = pctlabel),
vjust=-0.25) +
scale_y_continuous(labels = percent) +
theme_minimal() +
labs(x = "Payment",
y = "Frequency",
title = "Payment Mehods")
Plotting the bar graph for the frequency of each payment methods according to the product lines.
# plot the bars for Payment based on the colours of Product.line
plotdata <- SS_df_cleaned %>% count(Payment, Product.line) %>%
mutate(pct = n / sum(n),
pctlabel = paste0(round(pct*100), "%"))
ggplot(plotdata,
aes(x = Product.line,
y = pct,
fill = Payment)) +
geom_bar(stat = "identity",
position=position_dodge()) +
scale_fill_brewer(palette="Dark2") +
theme_minimal() +
theme(legend.position = "bottom") +
geom_text(aes(label = pctlabel),
vjust=0.5,
hjust = 1.5,
position = position_dodge(1.0),
size=3.5,
colour = "white") +
scale_y_continuous(labels = percent) +
labs(x = " ",
y = "Frequency",
title = "Payment methods according to the product lines") +
coord_flip() #horizontal bar graph
Results:
Plotting the bar graph for the frequency of the days.
# Days frequency
#count the days
Cdays <- SS_df_cleaned %>% count(Days)
#plot
ggplot(Cdays, aes(x = reorder(Days, n), y = n)) +
geom_bar(stat = "identity",
fill = "indianred3",
color = "black") +
labs(x = "Days",
y = "Frequency",
title = "Frequency of Days")+
theme_minimal()
Plotting the bar graph for the frequency of the time using binning method.
# Time frequency
#count the time
Ctime <- SS_df_cleaned %>% select(Time)
#create tags for binning
tags <- c("10 am - 12 pm", "12 pm - 3 pm", "3 pm - 5 pm", "5 pm - 7 pm",
"7 pm - 9 pm")
#binning the time column
time_bins <- as_tibble(Ctime) %>%
mutate(Time_bin = case_when(
Time >= 10 & Time < 12 ~ tags[1],
Time >= 12 & Time < 15 ~ tags[2],
Time >= 15 & Time < 17 ~ tags[3],
Time >= 17 & Time < 19 ~ tags[4],
Time >= 19 & Time < 21 ~ tags[5],
))
#count the percentage of the time bins
plotdata <- time_bins %>% count(Time_bin) %>%
mutate(pct = n / sum(n),
pctlabel = paste0(round(pct*100), "%"))
#plot
ggplot(plotdata,
aes(x = reorder(Time_bin, -pct),
y = pct)) +
geom_bar(stat = "identity",
fill = "#6fd1d1",
colour = "black") +
geom_text(aes(label = pctlabel),
vjust = -0.25) +
scale_y_continuous(labels = percent) +
labs(x = "Time Range",
y = "Frequency",
title = "Busy Hour")+
theme_minimal()
Plotting the bar graph for the frequency of the days and the time.
# day n time
#add the time bins into the new df
new <- SS_df_cleaned %>% mutate(time_bins)
plotdata <- new %>% count(Days, Time_bin) %>%
mutate(pct = n / sum(n),
pctlabel = paste0(round(pct*100), "%"))
ggplot(plotdata,
aes(x = Time_bin,
y = pctlabel,
fill = Days)) +
geom_bar(stat = "identity",
position=position_dodge()) +
scale_fill_brewer(palette="Set2") +
theme_minimal() +
theme(legend.position = "bottom") +
labs(x = "Time Range",
y = "Frequency",
title = "Busy Hour")
Results:
Plotting a heatmap to visualize relationship between sales on weekdays vs weekend.
library(superheat)
new$Days <- recode(new$Days,
"Monday" = "Weekdays",
"Tuesday" = "Weekdays",
"Wednesday" = "Weekdays",
"Thursday" = "Weekdays",
"Friday" = "Weekdays",
"Saturday" = "Weekends",
"Sunday" = "Weekends")
#sales relationship between weekdays and weekends
sales <- new %>% select(Total)
#create tags for binning
tags <- c("$0-$200", "$200-$400", "$400-$600","$600-$800", ">$800")
#binning the sales column
Total_bins <- as_tibble(sales) %>%
mutate(Total_bins = case_when(
Total >= 10 & Total < 200 ~ tags[1],
Total >= 200 & Total < 400 ~ tags[2],
Total >= 400 & Total < 600 ~ tags[3],
Total >= 600 & Total < 800 ~ tags[4],
Total >= 800 ~ tags[5]
))
# temporary df
temp <- new %>%
mutate(Total_bins) %>%
count(Days, Total_bins) %>%
filter(Days == "Weekends") %>%
mutate(pct = n / sum(n),
pctlabel = paste0(round(pct*100), "%"))
#plot
new %>%
mutate(Total_bins) %>%
count(Days, Total_bins) %>%
filter(Days == "Weekdays") %>%
mutate(pct = n / sum(n),
pctlabel = paste0(round(pct*100), "%")) %>%
add_row(temp) %>%
ggplot(mapping = aes(x = Days, y = Total_bins)) +
geom_tile(mapping = aes(fill = pct*100)) +
geom_text(aes(label=pctlabel), colour = "#8c9191") +
scale_fill_distiller(palette = "YlGnBu", direction = 1) +
guides(fill = guide_legend(title = "Pecent (%)")) +
theme_minimal() +
labs(x = " ",
y = " ",
title = "Customer's spendature during Weekdays and Weekends")
Results:
Based on the results, we can conclude that:
EDA does not determine models on the data, but it helps business owners to explore possible data analysis models that best suit the data based on the business problems and goals.
In conclusion, we have successfully meet our objective by answering the questions posed in the beginning for different types of problems.