In this project, I will use data from retail.csv to do exploratory data analysis. This data contains sales records from a retail company which I named PT. Vards. PT.Vards is a global retail company selling Office Supplies, Furniture and Technology stuff. The buyers come from various segments such as Corporate, Home Office, and individual consumers. This company can also provide the best options for customer delivery according to customer needs.
First, we need to read data
vards <- read.csv("data_input/retail.csv")Next, we want to see top 10 data owned by PT.Vards
head(vards,10)Then, we want to dimension of data
dim(vards)## [1] 9994 15
this data contains 9994 rows and 15 columns
The first step in conducting data analysis is to ensure that the data to be used is clean. The initial stage we do is check data type for each column in the data
str(vards)## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
## $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Based on data type for each column above, there is an incorrect data type. Therefore, we have to change the data type
vards$Order.Date <- as.Date(vards$Order.Date, format = "%m/%d/%y")
vards$Ship.Date <- as.Date(vards$Ship.Date, format = "%m/%d/%y")
vards$Ship.Mode <- as.factor(vards$Ship.Mode)
vards$Segment <- as.factor(vards$Segment)
vards$Category <- as.factor(vards$Category)
vards$Sub.Category <- as.factor(vards$Sub.Category)and we recheck the data type again
str(vards)## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
## $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Now the data type is in accordance with each column
anyNA(vards)## [1] FALSE
colSums(is.na(vards))## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 0 0 0 0 0 0
## Segment Product.ID Category Sub.Category Product.Name Sales
## 0 0 0 0 0 0
## Quantity Discount Profit
## 0 0 0
Great! We don’t have missing values.
Next step, we want to retrieve the data needed to perform data analysis, i.e. data from columns 3-15, because we don’t need Row.ID and Order.ID columns
vards <- vards[,3:15]
head(vards)finally, we finish doing data cleansing
summary(vards)## Order.Date Ship.Date Ship.Mode
## Min. :2014-01-03 Min. :2014-01-07 First Class :1538
## 1st Qu.:2015-05-23 1st Qu.:2015-05-27 Same Day : 543
## Median :2016-06-26 Median :2016-06-29 Second Class :1945
## Mean :2016-04-30 Mean :2016-05-03 Standard Class:5968
## 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
## Max. :2017-12-30 Max. :2018-01-05
##
## Customer.ID Segment Product.ID
## Length:9994 Consumer :5191 Length:9994
## Class :character Corporate :3020 Class :character
## Mode :character Home Office:1783 Mode :character
##
##
##
##
## Category Sub.Category Product.Name
## Furniture :2121 Binders :1523 Length:9994
## Office Supplies:6026 Paper :1370 Class :character
## Technology :1847 Furnishings: 957 Mode :character
## Phones : 889
## Storage : 846
## Art : 796
## (Other) :3613
## Sales Quantity Discount Profit
## Min. : 0.444 Min. : 1.00 Min. :0.0000 Min. :-6599.978
## 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
## Median : 54.490 Median : 3.00 Median :0.2000 Median : 8.666
## Mean : 229.858 Mean : 3.79 Mean :0.1562 Mean : 28.657
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
## Max. :22638.480 Max. :14.00 Max. :0.8000 Max. : 8399.976
##
xtabs(Sales~Category,vards)## Category
## Furniture Office Supplies Technology
## 741999.8 719047.0 836154.0
** Answer : Technology category has the highest sales **
nrow(vards[vards$Ship.Mode %in% c("Standard Class", "First Class"),])## [1] 7506
** Answer : There are 7506 transactions whose delivery uses Standard Class and First Class **
xtabs(Profit~Category,vards)## Category
## Furniture Office Supplies Technology
## 18451.27 122490.80 145454.95
** Answer : Technology category has the highest profit **
nrow(vards[vards$Category == "Technology" & vards$Profit < 0,])## [1] 271
** Answer : There are 271 transactions from Technology category that resulted in losses(profit less than zero) **
table(vards$Ship.Mode, vards$Segment)##
## Consumer Corporate Home Office
## First Class 769 485 284
## Same Day 317 114 112
## Second Class 1020 609 316
## Standard Class 3085 1812 1071
** Answer : Type of delivery that is most widely used by the corporate segment is standard class **
profit <- aggregate(Profit~Segment+Category,vards,mean)
profit[order(profit$Profit,decreasing=T),]** Answer : Home Office segment and Technology category have the highest average profit at PT.Vards **
sum(vards$Quantity)## [1] 37873
** Answer : There are 37873 sold at PT. Vards **
sum(vards$Profit)## [1] 286397
** Answer : PT. Vards get a total profit of 286397 **
sum(vards$Sales)## [1] 2297201
** Answer : PT. Vards get a total sales of 2297201 **
xtabs(Quantity~Category,vards)## Category
## Furniture Office Supplies Technology
## 8028 22906 6939
** Answer : Office supplies category has the most quantity sold **
During 2014 - 2017, PT.Vards had total sales of 2297201 and total profit of 286397 and sold 37873 items. The most sold category is the office supplies category. The type of delivery that is mostly used by customers is standard class. The technology category has the highest number of sales and profits compared to other categories.
The home office segment and technology category had the highest average profit, which was 89.152458. There are 271 transactions from the technology category that have losses. Then 7506 transactions use standard class and first class delivery types.
From the conclusion above, the technology category has the highest total sales and total profit compared to other categories. PT. Vards must find out what causes the technology category to have the highest total sales and total profit compared to other categories and find ways to increase the total profit and total sales of other categories.
The same thing happens with the type of delivery. The standard class delivery type is the most widely used type of delivery. PT. Vards must find out what causes standard class shipping types to be more in demand than other types of shipping. If because the price is cheaper, then PT. Vards may perform price reductions on other types of shipping to increase the number of customers using other types of shipping.