Library

library(parsedate)
library(scales)

Data Input

The data is from retail sales. Lets explore the data to see what kind of category and segment are popular.

retail <- read.csv("retail.csv")

Data Inspection

Lets see the data we have

head(retail)

Lets see the data type

str(retail)

## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
##  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

It seems that we have to change some of the data types

retail$Category <- factor(retail$Category) 
retail$Sub.Category <- factor(retail$Sub.Category) 
retail$Segment <- factor(retail$Segment)
retail$Ship.Mode <- factor(retail$Ship.Mode)
retail$Ship.Date <- parse_date(retail$Ship.Date)
retail$Order.Date <- parse_date(retail$Order.Date)

str(retail)

## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : POSIXct, format: "2016-11-08" "2016-11-08" ...
##  $ Ship.Date   : POSIXct, format: "2016-11-11" "2016-11-11" ...
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Okay, all the data already change into the right type

colSums(is.na(retail))

##       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
##            0            0            0            0            0            0 
##      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
##            0            0            0            0            0            0 
##     Quantity     Discount       Profit 
##            0            0            0

Because we have no NA in all columns, we can just go ahead and do exploration to answer business cases

Business Case 1

I want to know average of sales from every segment

segment_sales <- aggregate(Sales ~ Segment, data=retail, FUN = mean) 
segment_sales$presentase <- segment_sales$Sales / sum(segment_sales$Sales)
segment_sales$presentase <- percent(segment_sales$presentase)
segment_sales$Sales <- dollar(segment_sales$Sales)
segment_sales

Based on the data, the highest average sales was from Home Office segment.

Business Case 2

After I know, the average sales of each segment, I want to know the most profitable category

category_profit <- aggregate(Profit ~ Category + Segment, data = retail, FUN = sum)
category_profit <- category_profit[order(category_profit$Profit, decreasing = T), ]
category_profit$Profit <- dollar(category_profit$Profit)
category_profit

Based on the data, technology category from consumer segment has the highest profit. Also it seems that furniture category from all segment have the lowest profit.

Business Case 3

Lets see, what item has the highest profit and least in technology category and consumer segment.

item_sales <- retail[retail$Segment == "Consumer" & retail$Category == "Technology", ]
item_sales <- aggregate(Profit ~ Product.Name, data = retail, FUN = sum)
item_sales[order(item_sales$Profit, decreasing = T), ]

item_sales <- item_sales[item_sales$Profit > 0, ]
item_sales[order(item_sales$Profit, decreasing = F), ]

Based on the data, Canon imageCLASS 2200 Advanced Copier has the highest profit and HON 5400 Series Task Chairs for Big and Tall has the lowest profit.

Exploratory Data Analysis - Retail Dataset

Patricio Kresnatama

January 4, 2021