Performing “Exploratory Data Analysis” on dataset “Superstore”; a task provided by The Sparks Foundation for Graduate Rotational Internship Program.
Using the dataset provided, try to find out the weak areas where the company is not performing well so that the business manager can focus on those areas to make more profit in the future by drawing best recommendations from the analysis.
For this analysis, we will answer the following questions:-
Which states are selling less quantities and are less profitable and also which states are doing good?
What insight we can find by analyzing the “Quantity” and “Profit” by “Region” and “Category”?
Does the “Ship Mode” and “Discount” have an impact on the profit and quantity sold.
After analyzing the data to answer the above question, what recommendations can we draw to implement in making less profitable categories and regions do better?
#install.packages("tidyverse")
#install.packages("dplyr")
#install.packages("here")
#install.packages("reader")
#install.packages("skimr")
#install.packages("plotrix")
#install.packages("anytime")
#tinytex::install_tinytex()
#install.packages("tidyr")
#install.packages("usmap")
#install.packages("gridExtra")
library("tidyverse")
library("dplyr")
library("lubridate")
library("reader")
library("here")
library("ggplot2")
library("plotrix")
library("magrittr")
library("anytime")
library("tidyr")
library("usmap")
library("gridExtra")
SampleSuperstore <- read_csv("C:/Users/selam/Desktop/Case Study/The Spark Foundation Tasks/SampleSuperstore (1).csv")
head(SampleSuperstore)
sum(duplicated(SampleSuperstore))
## [1] 17
SampleSuperstore <- SampleSuperstore%>%
distinct() %>%
drop_na()
names(SampleSuperstore) <- tolower(names(SampleSuperstore))
We will begin the analysis process by summarizing our data which includes calculating average and sum of sales, quantity, profit and grouping by region and categories.
ave_sales <- SampleSuperstore %>%
group_by(region, state) %>%
summarise(mean_sales = round(mean(sales), digits = 2), mean_quan = round(mean(quantity), digits = 0) , mean_profit =round(mean(profit), digits = 2), mean_discount = round(mean(discount), digits = 2))
glimpse(ave_sales)
sum_of_sales <- SampleSuperstore %>%
group_by(region, category) %>%
summarise(sum_of_quan = sum(quantity), sum_of_sales = sum(sales), sum_of_profit = sum(profit))
head(sum_of_sales)
First we plot a graph to analyze the profit by region and category
ggplot(data = sum_of_sales, mapping =aes(x = region, y = sum_of_profit, fill = region))+
geom_bar(stat = "identity")+
facet_wrap(~category)+
theme_classic()+
labs(
x = "Region",
y = "Profit",
title = paste(
"Profit by Region and Category"
)
)
plot_usmap(data = ave_sales,
regions = "states",
values = "mean_profit")+
labs(title = "Average Profit Distribution Across the State")+
theme(legend.position = "right")
Observation
→ In all categories West region is performing good.
→ In furniture category, we can see a loss which is in Central region.
→ Among the three Categories Office Supplies is the top profitable followed by technology.
→ Furniture is the least profitable category in all regions.
Now let see if “Quantity” and “Profit” have a positive correlation;
ggplot(data = SampleSuperstore, aes(x = quantity, y = profit))+
geom_point(aes(color = category))+
labs(
title = paste(
"Quantity and Profit by Category"
)
)+
facet_wrap(~region)
As we can see from the above plot there is no positive correlation between the quantity sold and the profit earned; so that we can safely say that quantity sold is not the factor for being profitable in all categories and regions.
perc_by_category <- SampleSuperstore %>%
group_by(category) %>%
summarise(total = n()) %>%
mutate(totals =sum(total)) %>%
group_by(category) %>%
mutate(percent = scales::percent(total/totals))
perc_by_category$category <- factor(perc_by_category$category , labels = c("Furniture", "Office Supplies", "Technology"))
head(perc_by_category)
## # A tibble: 3 × 4
## # Groups: category [3]
## category total totals percent
## <fct> <int> <int> <chr>
## 1 Furniture 2118 9977 21%
## 2 Office Supplies 6012 9977 60%
## 3 Technology 1847 9977 19%
perc_by_region <- SampleSuperstore %>%
group_by(region) %>%
summarise(total = n()) %>%
mutate(totals =sum(total)) %>%
group_by(region) %>%
mutate(percent = scales::percent(total/totals))
perc_by_region$region <- factor(perc_by_region$region, labels = c("Central", "East", "South", "West"))
head(perc_by_region)
## # A tibble: 4 × 4
## # Groups: region [4]
## region total totals percent
## <fct> <int> <int> <chr>
## 1 Central 2319 9977 23%
## 2 East 2845 9977 29%
## 3 South 1620 9977 16%
## 4 West 3193 9977 32%
→ For a better visualization, let’s plot a pie chart
par(mfrow = c(1,2))
plot1 <- ggplot(data = perc_by_category,
aes(x="", y = percent, fill = category))+
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start=0)+
geom_text(aes(y = percent, label = paste(percent, sep = "")), col = "black", position = position_stack(vjust = 0.5), size = 2.5)+
labs(title="Percentage by Category")+
theme_void()+
scale_fill_brewer(palette = 1)
plot2 <- ggplot(data = perc_by_region,
aes(x = "", y = percent, fill = region))+
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start=0)+
geom_text(aes(y = percent, label = paste(percent, sep = "")), col = "black", position = position_stack(vjust = 0.5), size = 2.5)+
labs(title="Percentage by Region")+
theme_void()+
scale_fill_brewer(palette = 1)
grid.arrange(plot1, plot2, ncol = 2)
From the above pie chart we can see that: -
→ 60% of total sales is in the office supplies category,
→ 32% of the total sales is in the West region.
Now let see which State, subcategory and ship mode are more profitable;
sum_subcat_profit <- SampleSuperstore %>%
group_by(`sub-category`) %>%
summarise(sum_of_sales = sum(sales), sum_of_profit = sum(profit))
glimpse(sum_subcat_profit)
sum_state_profit <- SampleSuperstore %>%
group_by(state) %>%
summarise(sum_of_sales = sum(sales), sum_of_profit = sum(profit))
glimpse(sum_state_profit)
ship_mode <- SampleSuperstore %>%
group_by(`ship mode`, category) %>%
summarise(sum_of_sales = sum(sales), sum_of_profit = sum(profit))
head(ship_mode)
Plotting
par(mfrow = c(1,2))
plotA <- top_n(sum_subcat_profit, n = 5, sum_of_profit) %>%
ggplot(., mapping = aes(x = `sub-category`, y = sum_of_profit, fill = `sub-category`))+
geom_bar(stat = "identity")+
theme_void()+
labs(title = paste(
"Higher Proft by Subcategory"
)
)
plotB <- sum_subcat_profit%>%
arrange(sum_of_profit) %>%
slice(1:5) %>%
ggplot(., aes(x = `sub-category`, y = sum_of_profit, fill = `sub-category`))+
geom_bar(stat = "identity")+
theme_void()+
labs(title = paste(
"Lower Proft by Subcategory"
)
)
grid.arrange(plotA, plotB, ncol = 2)
From the above graph we can see that:
→ “Copiers” are number one profitable subcategory followed by “Phones”,
→ Three subcategories; “Bookcases”, “Supplies” and “Tables” noticeably lower in profit and having a loss.
Profitable States
top_n(sum_state_profit, n = 5, sum_of_profit) %>%
ggplot(., aes(x = state, y = sum_of_profit, fill = state))+
geom_bar(stat = "identity")+
theme_classic()+
labs(
x = "State",
y = "Profit",
title = paste(
"Proft by State"
)
)
The top 5 States from the perspective of profitability are: -
Shipping mode percentage and Profit
ship_mode_percent <- SampleSuperstore %>%
group_by(`ship mode`) %>%
summarise(total = n()) %>%
mutate(totals = sum(total)) %>%
group_by(`ship mode`) %>%
mutate(percent = scales::percent(total/totals))
head(ship_mode_percent)
## # A tibble: 4 × 4
## # Groups: ship mode [4]
## `ship mode` total totals percent
## <chr> <int> <int> <chr>
## 1 First Class 1537 9977 15%
## 2 Same Day 542 9977 5%
## 3 Second Class 1943 9977 19%
## 4 Standard Class 5955 9977 60%
par(mfrow = c(1,2))
hsize <- 1
ship_mode_percent %>%
mutate(x = hsize)
## # A tibble: 4 × 5
## # Groups: ship mode [4]
## `ship mode` total totals percent x
## <chr> <int> <int> <chr> <dbl>
## 1 First Class 1537 9977 15% 1
## 2 Same Day 542 9977 5% 1
## 3 Second Class 1943 9977 19% 1
## 4 Standard Class 5955 9977 60% 1
plot1 <- ggplot(ship_mode_percent, aes(x = hsize, y = total, fill = `ship mode`))+
geom_col(color = "black")+
geom_text(aes(label = percent),
position = position_stack(vjust = 0.5), size = 3)+
labs(title = "Shiping Mode Percentile")+
coord_polar(theta = "y")+
theme_void()+
theme(plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
scale_fill_brewer(palette = 4)+
xlim(c(0.2, hsize + 0.5))
plot2 <- ggplot(data = ship_mode, mapping =aes(x = `ship mode`, y = sum_of_profit, fill = `ship mode`))+
geom_bar(stat = "identity")+
labs(title = "Profit by Ship Mode")+
scale_fill_brewer(palette = 4)+
theme(axis.text.x = element_text(angle = 45),
plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
theme_void()
grid.arrange(plot1, plot2, ncol = 2)
As we can see from the above chart, Standard Class is the number one profitable shipping mode followed by Second Class shipping mode.
Shipping mode can be one of the factor for being profitable as we can see some trend from the above visualization.
Discount impact on average profit
ggplot(data = ave_sales, aes(x = mean_profit, y = mean_discount))+
geom_point(aes(color = region))+
labs(
title = paste(
"Quantity and Profit by Category"
)
)
It shows that the higher the discount the lower the profit; so that discount can also be the other factor for loss and not being profitable.
To conclude our analysis: -
West region shows a good performance in sales and profit which is 32% on the other hand Central America is the list performing state.
Office Supplies is the top profitable category which is 60% followed by technology.
Furniture is the least profitable category in all region.
Quantity and profit doesn’t have a positive correlation.
Copier is more profitable subcategory followed by Phone
There is a loss that we can see in Book Case, Supplies and Table subcategory.
Top five States that are performing good are: California, Michigan, New York, Virginia and Washington.
Recommendation
1.For the states and subcategories that are already profitable, we need to focus more in promoting sales and distribution of the product.
For the items that encounters loss, we need to remove discount and take in to consideration what kind of shipping mode can be more cost effective based on the region that the item shipped.
The majority of categories that are more profitable, Standard Class ship mode used so that, it is more ideal to charge extra fee for the customer who likes to use the other shipping mode.
Discount is more ideal in the event of customers ordering a bulk amount and when they agree to use the most cost effective shipping mode avaliable.