library(parsedate)
library(scales)The data is from retail sales. Lets explore the data to see what kind of category and segment are popular.
retail <- read.csv("retail.csv")Lets see the data we have
head(retail)Lets see the data type
str(retail)## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
## $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
It seems that we have to change some of the data types
retail$Category <- factor(retail$Category)
retail$Sub.Category <- factor(retail$Sub.Category)
retail$Segment <- factor(retail$Segment)
retail$Ship.Mode <- factor(retail$Ship.Mode)
retail$Ship.Date <- parse_date(retail$Ship.Date)
retail$Order.Date <- parse_date(retail$Order.Date)str(retail)## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : POSIXct, format: "2016-11-08" "2016-11-08" ...
## $ Ship.Date : POSIXct, format: "2016-11-11" "2016-11-11" ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Okay, all the data already change into the right type
colSums(is.na(retail))## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 0 0 0 0 0 0
## Segment Product.ID Category Sub.Category Product.Name Sales
## 0 0 0 0 0 0
## Quantity Discount Profit
## 0 0 0
Because we have no NA in all columns, we can just go ahead and do exploration to answer business cases
I want to know average of sales from every segment
segment_sales <- aggregate(Sales ~ Segment, data=retail, FUN = mean)
segment_sales$presentase <- segment_sales$Sales / sum(segment_sales$Sales)
segment_sales$presentase <- percent(segment_sales$presentase)
segment_sales$Sales <- dollar(segment_sales$Sales)
segment_salesBased on the data, the highest average sales was from Home Office segment.
After I know, the average sales of each segment, I want to know the most profitable category
category_profit <- aggregate(Profit ~ Category + Segment, data = retail, FUN = sum)
category_profit <- category_profit[order(category_profit$Profit, decreasing = T), ]
category_profit$Profit <- dollar(category_profit$Profit)
category_profitBased on the data, technology category from consumer segment has the highest profit. Also it seems that furniture category from all segment have the lowest profit.
Lets see, what item has the highest profit and least in technology category and consumer segment.
item_sales <- retail[retail$Segment == "Consumer" & retail$Category == "Technology", ]
item_sales <- aggregate(Profit ~ Product.Name, data = retail, FUN = sum)
item_sales[order(item_sales$Profit, decreasing = T), ]item_sales <- item_sales[item_sales$Profit > 0, ]
item_sales[order(item_sales$Profit, decreasing = F), ]Based on the data, Canon imageCLASS 2200 Advanced Copier has the highest profit and HON 5400 Series Task Chairs for Big and Tall has the lowest profit.