1. Introduction

Performing “Exploratory Data Analysis” on dataset “Superstore”; a task provided by The Sparks Foundation for Graduate Rotational Internship Program.

2. Task Summary

Using the dataset provided, try to find out the weak areas where the company is not performing well so that the business manager can focus on those areas to make more profit in the future by drawing best recommendations from the analysis.

For this analysis, we will answer the following questions:-

3. Process

3.1. Installing and Loading Packages

#install.packages("tidyverse")
#install.packages("dplyr")
#install.packages("here")
#install.packages("reader")
#install.packages("skimr")
#install.packages("plotrix")
#install.packages("anytime")
#tinytex::install_tinytex()
#install.packages("tidyr")
#install.packages("usmap")
#install.packages("gridExtra")
library("tidyverse")
library("dplyr")
library("lubridate")
library("reader")
library("here")
library("ggplot2")
library("plotrix")
library("magrittr")
library("anytime")
library("tidyr")
library("usmap")
library("gridExtra")

3.2. Importing dataset

SampleSuperstore <- read_csv("C:/Users/selam/Desktop/Case Study/The Spark Foundation Tasks/SampleSuperstore (1).csv")
head(SampleSuperstore)

3.3. Cleaning the Data

sum(duplicated(SampleSuperstore))
## [1] 17
SampleSuperstore <- SampleSuperstore%>% 
 distinct() %>% 
  drop_na()
names(SampleSuperstore) <- tolower(names(SampleSuperstore))

4. Exploratory Analysis

4.1 Aggregation

We will begin the analysis process by summarizing our data which includes calculating average and sum of sales, quantity, profit and grouping by region and categories.

ave_sales <- SampleSuperstore %>% 
  group_by(region, state) %>% 
  summarise(mean_sales = round(mean(sales), digits = 2), mean_quan = round(mean(quantity), digits = 0) , mean_profit =round(mean(profit), digits = 2), mean_discount = round(mean(discount), digits = 2))
glimpse(ave_sales)
sum_of_sales <- SampleSuperstore %>% 
  group_by(region, category) %>% 
  summarise(sum_of_quan = sum(quantity), sum_of_sales = sum(sales), sum_of_profit = sum(profit)) 
head(sum_of_sales)

First we plot a graph to analyze the profit by region and category

ggplot(data = sum_of_sales, mapping =aes(x = region, y = sum_of_profit, fill = region))+
  geom_bar(stat = "identity")+
  facet_wrap(~category)+
  theme_classic()+
  labs(
    x = "Region",
    y = "Profit",
    title = paste(
      "Profit by Region and Category"
    )
  )

plot_usmap(data = ave_sales,
           regions = "states", 
           values = "mean_profit")+
  labs(title = "Average Profit Distribution Across the State")+
  theme(legend.position = "right")

Observation

→ In all categories West region is performing good.

→ In furniture category, we can see a loss which is in Central region.

→ Among the three Categories Office Supplies is the top profitable followed by technology.

→ Furniture is the least profitable category in all regions.

Now let see if “Quantity” and “Profit” have a positive correlation;

ggplot(data = SampleSuperstore, aes(x = quantity, y = profit))+
  geom_point(aes(color = category))+
  labs(
    title = paste(
      "Quantity and Profit by Category"
    )
  )+
  facet_wrap(~region)

As we can see from the above plot there is no positive correlation between the quantity sold and the profit earned; so that we can safely say that quantity sold is not the factor for being profitable in all categories and regions.

4.2 Percentage

Percentage by Category

perc_by_category <- SampleSuperstore %>% 
  group_by(category) %>%
  summarise(total = n()) %>% 
  mutate(totals =sum(total)) %>% 
  group_by(category) %>% 
  mutate(percent = scales::percent(total/totals))
perc_by_category$category <- factor(perc_by_category$category , labels = c("Furniture", "Office Supplies", "Technology"))
head(perc_by_category)
## # A tibble: 3 × 4
## # Groups:   category [3]
##   category        total totals percent
##   <fct>           <int>  <int> <chr>  
## 1 Furniture        2118   9977 21%    
## 2 Office Supplies  6012   9977 60%    
## 3 Technology       1847   9977 19%

Percentage by Region

perc_by_region <- SampleSuperstore %>% 
  group_by(region) %>%
  summarise(total = n()) %>% 
  mutate(totals =sum(total)) %>% 
  group_by(region) %>% 
  mutate(percent = scales::percent(total/totals))
perc_by_region$region <- factor(perc_by_region$region, labels = c("Central", "East", "South", "West"))
head(perc_by_region)
## # A tibble: 4 × 4
## # Groups:   region [4]
##   region  total totals percent
##   <fct>   <int>  <int> <chr>  
## 1 Central  2319   9977 23%    
## 2 East     2845   9977 29%    
## 3 South    1620   9977 16%    
## 4 West     3193   9977 32%

→ For a better visualization, let’s plot a pie chart

par(mfrow = c(1,2))
plot1 <- ggplot(data = perc_by_category, 
       aes(x="", y = percent, fill = category))+
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  geom_text(aes(y = percent, label = paste(percent, sep = "")), col = "black", position = position_stack(vjust = 0.5), size = 2.5)+
  labs(title="Percentage by Category")+
  theme_void()+
  scale_fill_brewer(palette = 1)
plot2 <- ggplot(data = perc_by_region, 
       aes(x = "", y = percent, fill = region))+
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  geom_text(aes(y = percent, label = paste(percent, sep = "")), col = "black", position = position_stack(vjust = 0.5), size = 2.5)+
  labs(title="Percentage by Region")+
  theme_void()+
  scale_fill_brewer(palette = 1)
grid.arrange(plot1, plot2, ncol = 2)

From the above pie chart we can see that: -

→ 60% of total sales is in the office supplies category,

→ 32% of the total sales is in the West region.

Now let see which State, subcategory and ship mode are more profitable;

sum_subcat_profit <- SampleSuperstore %>% 
  group_by(`sub-category`) %>% 
  summarise(sum_of_sales = sum(sales), sum_of_profit = sum(profit)) 
glimpse(sum_subcat_profit)
sum_state_profit <- SampleSuperstore %>% 
  group_by(state) %>% 
  summarise(sum_of_sales = sum(sales), sum_of_profit = sum(profit)) 
glimpse(sum_state_profit)
ship_mode <- SampleSuperstore %>% 
  group_by(`ship mode`, category) %>% 
  summarise(sum_of_sales = sum(sales), sum_of_profit = sum(profit)) 
head(ship_mode)

Plotting

par(mfrow = c(1,2))
plotA <- top_n(sum_subcat_profit, n = 5, sum_of_profit) %>%
ggplot(., mapping = aes(x = `sub-category`, y = sum_of_profit, fill = `sub-category`))+
  geom_bar(stat = "identity")+
  theme_void()+
  labs(title = paste(
      "Higher Proft by Subcategory"
    )
  )
plotB <- sum_subcat_profit%>% 
  arrange(sum_of_profit) %>% 
  slice(1:5) %>% 
ggplot(., aes(x = `sub-category`, y = sum_of_profit, fill = `sub-category`))+
  geom_bar(stat = "identity")+
  theme_void()+
  labs(title = paste(
      "Lower Proft by Subcategory"
    )
  )
grid.arrange(plotA, plotB, ncol = 2)

From the above graph we can see that:

→ “Copiers” are number one profitable subcategory followed by “Phones”,

→ Three subcategories; “Bookcases”, “Supplies” and “Tables” noticeably lower in profit and having a loss.

Profitable States

top_n(sum_state_profit, n = 5, sum_of_profit) %>% 
ggplot(., aes(x = state, y = sum_of_profit, fill = state))+
  geom_bar(stat = "identity")+
  theme_classic()+
  labs(
    x = "State",
    y = "Profit",
    title = paste(
      "Proft by State"
    )
  )

The top 5 States from the perspective of profitability are: -

  • California
  • Michigan
  • New York
  • Virginia
  • Washington

Shipping mode percentage and Profit

ship_mode_percent <- SampleSuperstore %>% 
  group_by(`ship mode`) %>% 
  summarise(total = n()) %>% 
  mutate(totals = sum(total)) %>% 
  group_by(`ship mode`) %>% 
  mutate(percent = scales::percent(total/totals))
head(ship_mode_percent)
## # A tibble: 4 × 4
## # Groups:   ship mode [4]
##   `ship mode`    total totals percent
##   <chr>          <int>  <int> <chr>  
## 1 First Class     1537   9977 15%    
## 2 Same Day         542   9977 5%     
## 3 Second Class    1943   9977 19%    
## 4 Standard Class  5955   9977 60%
par(mfrow = c(1,2))
hsize <- 1
ship_mode_percent %>% 
  mutate(x = hsize)
## # A tibble: 4 × 5
## # Groups:   ship mode [4]
##   `ship mode`    total totals percent     x
##   <chr>          <int>  <int> <chr>   <dbl>
## 1 First Class     1537   9977 15%         1
## 2 Same Day         542   9977 5%          1
## 3 Second Class    1943   9977 19%         1
## 4 Standard Class  5955   9977 60%         1
plot1 <- ggplot(ship_mode_percent, aes(x = hsize, y = total, fill = `ship mode`))+
  geom_col(color = "black")+
  geom_text(aes(label = percent),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title = "Shiping Mode Percentile")+
  coord_polar(theta = "y")+
  theme_void()+
  theme(plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
  scale_fill_brewer(palette = 4)+
  xlim(c(0.2, hsize + 0.5))

plot2 <- ggplot(data = ship_mode, mapping =aes(x = `ship mode`, y = sum_of_profit, fill = `ship mode`))+
  geom_bar(stat = "identity")+
  labs(title = "Profit by Ship Mode")+
  scale_fill_brewer(palette = 4)+
  theme(axis.text.x = element_text(angle = 45),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
  theme_void()
grid.arrange(plot1, plot2, ncol = 2)

As we can see from the above chart, Standard Class is the number one profitable shipping mode followed by Second Class shipping mode.

Shipping mode can be one of the factor for being profitable as we can see some trend from the above visualization.

Discount impact on average profit

ggplot(data = ave_sales, aes(x = mean_profit, y = mean_discount))+
  geom_point(aes(color = region))+
  labs(
    title = paste(
      "Quantity and Profit by Category"
    )
  )

It shows that the higher the discount the lower the profit; so that discount can also be the other factor for loss and not being profitable.

Conclusion and Recommendations

To conclude our analysis: -

Recommendation

1.For the states and subcategories that are already profitable, we need to focus more in promoting sales and distribution of the product.

  1. For the items that encounters loss, we need to remove discount and take in to consideration what kind of shipping mode can be more cost effective based on the region that the item shipped.

  2. The majority of categories that are more profitable, Standard Class ship mode used so that, it is more ideal to charge extra fee for the customer who likes to use the other shipping mode.

  3. Discount is more ideal in the event of customers ordering a bulk amount and when they agree to use the most cost effective shipping mode avaliable.