This is an EDA project analyzing super store data set and visualizing it. The objective of this project is to analyze and identify trends and patterns in the current retail sales and identify which sector of the market is under loss and which sector is making huge profits.
LOADING LIBRARIES
library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
READING DATASET
store <- read.csv("SampleSuperstore.csv")
View(store)
str(store)
## 'data.frame': 9994 obs. of 13 variables:
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Country : chr "United States" "United States" "United States" "United States" ...
## $ City : chr "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
## $ State : chr "Kentucky" "Kentucky" "California" "Florida" ...
## $ Postal.Code : int 42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
## $ Region : chr "South" "South" "West" "South" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
summary(store)
## Ship.Mode Segment Country City
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## State Postal.Code Region Category
## Length:9994 Min. : 1040 Length:9994 Length:9994
## Class :character 1st Qu.:23223 Class :character Class :character
## Mode :character Median :56431 Mode :character Mode :character
## Mean :55190
## 3rd Qu.:90008
## Max. :99301
## Sub.Category Sales Quantity Discount
## Length:9994 Min. : 0.444 Min. : 1.00 Min. :0.0000
## Class :character 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000
## Mode :character Median : 54.490 Median : 3.00 Median :0.2000
## Mean : 229.858 Mean : 3.79 Mean :0.1562
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000
## Max. :22638.480 Max. :14.00 Max. :0.8000
## Profit
## Min. :-6599.978
## 1st Qu.: 1.729
## Median : 8.666
## Mean : 28.657
## 3rd Qu.: 29.364
## Max. : 8399.976
DATA PREPARATION AND CLEANING
## [1] FALSE
VISUALIZATIONS 1) Sales vs Quantity In below graph, we see pattern that most of sales have triggered by the standard class of shipment mode
#Let's analyze patterns
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## 1) Sales vs Quantity
sale_v_quantity <- ggplot(data = store_1, aes(x = Quantity, y = Sales, fill = Ship.Mode)) +
geom_bar(stat = "identity")
ggplotly(sale_v_quantity)
sale_v_profit <- ggplot(data = store_1, aes(x = Sales, y = Profit, color = Ship.Mode)) +
geom_point()
ggplotly(sale_v_profit)
sale_v_discount <- ggplot() +
geom_point(data = store_1, aes(x = Discount, y = Sales, color = Ship.Mode))
ggplotly(sale_v_discount)
It is evident from graph that discounts attract more sales. Mostly discount attracts Standard class shipment
profit_v_discount <- ggplot() +
geom_bar(data = store_1, aes(x = Discount, y = Profit, fill = Ship.Mode), stat = "identity")
ggplotly(profit_v_discount)
Here we clearly see that more discounts have been offered and redeemed, segments have received lesser profits. Products with no discounts show high range of profits but as the discount range increases, we only see more and more loss with hardly any profit.
Now let’s see if this case happens with rest of the segments
#Plot for category vs profit
category_v_profit <- ggplot() +
geom_bar(data = store_1, aes(x = Sub.Category, y = Profit, fill = Region), stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
ggplotly(category_v_profit)
We see more losses in Binders industry mainly in central region and Machines and Tables industry also face losses
category_v_sales <- ggplot() +
geom_bar(data = store_1, aes(x = Category, y = Sales, fill = Region), stat = "identity") +
theme(axis.text.x = element_text(angle = 90 , vjust = 0.5, hjust = 1))
ggplotly(category_v_sales)
Sales have been incurred by technology followed by furniture and office supplies. Mostly sales have been made from West and East regions.
category_v_profit_1 <- ggplot() + geom_bar(data = store_1, aes(x = Category, y = Profit, fill = Region), stat = "identity") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggplotly(category_v_profit_1)
The furniture category incurrs more losses than losses in the technology and Office Supplies category. Since, Sales vary from low to high in this category so there is profit.
#Sales vs profit filling category
sales_v_profit <- ggplot() +
geom_point(data = store_1, aes(x = Sales , y = Profit, color = Category))
ggplotly(sales_v_profit)
From above graphs we conclude that sales to profit ratio is same in every category, no matter how they are clubbed.
CONCLUSION :-
Same day shipment if receives more discounts can trigger sales/profits. Discounts should be based on the Sales and should not increase a particular range otherwise unnecessary discounts with low sales can witness huge losses Binders and Machines industry should be focused upon more so as to strengthen these weakened industry areas. Office Supplies and the Furniture industries do not seem to boom in the Central Region.