On this occasion we will conduct simple exploratory data analysis of a sales data in a retail company.
The buyers are coming from different segment such as Corporate, Home Office until individual consumer. This company also could provide the best option for your delivery which suitable with your needs.
Input the data and store it in a variable named Retail.
Retail <- read.csv("data_input/retail.csv")And now we can do data inspection and cleansing.
Check if the saved data is correct.
head(Retail)Inspect the data.
str(Retail)## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
## $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
dim(Retail)## [1] 9994 15
names(Retail)## [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date" "Ship.Mode"
## [6] "Customer.ID" "Segment" "Product.ID" "Category" "Sub.Category"
## [11] "Product.Name" "Sales" "Quantity" "Discount" "Profit"
From our inspection we can conclude :
* Retail data contain 9994 of rows and 15 of coloumns.
* Each of column name :
01. “Row.ID”,
02. “Order.ID”,
03. “Order.Date”,
04. “Ship.Date”,
05. “Ship.Mode”,
06. “Customer.ID”,
07. “Segment”,
08. “Product.ID”,
09. “Category”,
10. “Sub.Category”,
11. “Product.Name”,
12. “Sales”,
13. “Quantity”,
14. “Discount”, 15. “Profit”.
From the str() result, we find some of data type not in the corect type. we need to convert it into corect type (data coertion).
Retail$Row.ID <- as.character(Retail$Row.ID)
Retail$Order.Date<- as.Date(Retail$Order.Date, "%m/%d/%y")
Retail$Ship.Date <- as.Date(Retail$Ship.Date, "%m/%d/%y")
Retail$Ship.Mode <- as.factor(Retail$Ship.Mode)
Retail$Customer.ID<-as.character(Retail$Customer.ID)
Retail$Segment <- as.factor(Retail$Segment)
Retail$Product.ID<- as.character(Retail$Product.ID)
Retail$Category <- as.factor(Retail$Category )
Retail$Sub.Category<-as.factor(Retail$Sub.Category)
Retail$Product.Name<-as.factor(Retail$Product.Name)
str(Retail)## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : chr "1" "2" "3" "4" ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
## $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: Factor w/ 1850 levels "\"While you Were Out\" Message Book, One Form per Page",..: 387 833 1440 367 574 570 1137 1099 535 295 ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Each of column already changed into desired data type
Now, we have to check for the missing value in the data.
colSums(is.na(Retail))## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 0 0 0 0 0 0
## Segment Product.ID Category Sub.Category Product.Name Sales
## 0 0 0 0 0 0
## Quantity Discount Profit
## 0 0 0
anyNA(Retail)## [1] FALSE
From the result above, now we know that there are no missing value in the Retail data.
We will do subsetting to delete some column (1&2 because we dont need the informations). then save it into Retail_new variable
Retail_new <- Retail[,c(3:15)]
head(Retail_new)We will see a brief information from Retail_new data.
summary(Retail_new)## Order.Date Ship.Date Ship.Mode
## Min. :2014-01-03 Min. :2014-01-07 First Class :1538
## 1st Qu.:2015-05-23 1st Qu.:2015-05-27 Same Day : 543
## Median :2016-06-26 Median :2016-06-29 Second Class :1945
## Mean :2016-04-30 Mean :2016-05-03 Standard Class:5968
## 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
## Max. :2017-12-30 Max. :2018-01-05
##
## Customer.ID Segment Product.ID
## Length:9994 Consumer :5191 Length:9994
## Class :character Corporate :3020 Class :character
## Mode :character Home Office:1783 Mode :character
##
##
##
##
## Category Sub.Category Product.Name
## Furniture :2121 Binders :1523 Staple envelope : 48
## Office Supplies:6026 Paper :1370 Easy-staple paper : 46
## Technology :1847 Furnishings: 957 Staples : 46
## Phones : 889 Avery Non-Stick Binders : 20
## Storage : 846 Staples in misc. colors : 19
## Art : 796 KI Adjustable-Height Table: 18
## (Other) :3613 (Other) :9797
## Sales Quantity Discount Profit
## Min. : 0.444 Min. : 1.00 Min. :0.0000 Min. :-6599.978
## 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
## Median : 54.490 Median : 3.00 Median :0.2000 Median : 8.666
## Mean : 229.858 Mean : 3.79 Mean :0.1562 Mean : 28.657
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
## Max. :22638.480 Max. :14.00 Max. :0.8000 Max. : 8399.976
##
Summary :
1. First order occured in Jan 2014.
2. Standart Class was the most popular one for shipping option and Same
day delivery was the lowest one.
3. The most buyers coming from consumer segment and home office segment
was the lowest.
4. Office Supply category was the best seller compare to others.
5. In Sub category, other stuff was the most favourite one then followed
by Binders and paper. Art is the lowest.
6. The average sales at 229.858; with max sales value at 22638.480 and
minumum sales value at 0.444
7. Max quantity order was 14 within 4 years but average at 3.79.
8. Max amount of discount was 80%, but mean number of discount around
15%.
9. The ever had loss at 6599 and gained 8399 as their highest profit
with average profit at 28.657.
Talking about “sales” then surely we will talking about
“profit”.
So next we will do some aggregation using the Profit variable
as the main reference with several other variables that are very
related.
These variables include Category, Sub.category, &
Product.Name.
Therefore we will do subsetting so that the data that appears is only
data from these variables.
Retail_Profit <- Retail_new[ ,c("Category", "Sub.Category", "Product.Name", "Profit")]And now we can carry out the above-mentioned aggression process along with visualizing the data.
Retail_Profit_Category <- aggregate(Profit~Category,Retail_Profit,mean)ggplot(data = Retail_Profit_Category, mapping = aes(x=Profit, y=Category )) +
geom_col()Retail_Profit_Sub.Category <- aggregate(Profit~Sub.Category,Retail_Profit,mean)ggplot(data = Retail_Profit_Sub.Category, mapping = aes(x= Profit, y= Sub.Category )) +
geom_col()Retail_Profit_Product.Name <- aggregate(Profit~Product.Name,Retail_Profit,mean)
head(Retail_Profit_Product.Name[order(Retail_Profit_Product.Name$Profit, decreasing=T), ],5)tail(Retail_Profit_Product.Name[order(Retail_Profit_Product.Name$Profit, decreasing=T), ],5)Based on the exploration of the data above, we can perform the
following analysis:
1. The “technology” category is the category that
generates the highest profit. And the furniture category is the one with
the lowest profit among the three categories.
Suggestion : Prioritize adding product promotions to the
furniture category but still maintaining the ongoing promotions for the
other two categories.
2. In the sub-category for the type of “copier”
product, it gets profit far above other sub-categories. Meanwhile, other
sub-categories such as tables, supplies, & bookcases experienced
losses.
Suggestion : Prioritize adding more product promotions to the
other sub-category other than “copier” but still maintaining the ongoing
promotions for the other two categories. And provide massive discounts
for sales of sub-categories of products experiencing losses.
3. “Canon imageCLASS 2200 Advanced Copier” is a product
that produces the highest profit among other products. However, there
are some products as well that cause loss.
Suggestion : provide massive discounts for sales of products
experiencing losses to minimize losses.