LBB1: Retail Exploratory Data Analysis

1. Case Description

Brief Introduction

Hi My Name is Ade Anggi Naluriawan Santoso. Nice to meet you! This is my first RPubs Document as part of Learn By Building Assignment at Algoritma Data Science School.

Data Explanation

The data stored a record of a transactions of a retail store specifying several variables. Please take a look at the following glossary for your referrence:
- Order.ID: Id of order.
- Order.Date: Date of order.
- Ship.Date: Date of shipment.
- Ship.Mode: Type of shipment.
- Customer.ID: Id of customer.
- Segment : Customer’s segment.
- Product.ID: Id of product.
- Category : Product category with 3 levels (“Furniture”, “Office Supplies”, “Technology”)
- Sub.Category: more specific product category
- Product.Name: Name of product that was sold.
- Sales: How much earning from each sale.
- Quantity: Quantity of item sold.
- Discount: How much discount was given for each sale.
- Profit: How much can a company earn from each sale.

2. Load Dataset

First of all, we need to read retail.csv in our folder using read.csv() function, save it into an object named retail_df, and display first 6 rows of the data.

retail_df <- read.csv("retail.csv")
head(retail_df)

2.1. Inspect Data

Display last 6 rows of retail dataset using ‘tail’ function

tail(retail_df)

Retail dataset has 14 columns and 9994 rows. We can quickly check the dimension of our data by using ‘dim’ function.

dim(retail_df)

#> [1] 9994   14

Even though we already know column names inside retail dataset, we still can validate by using ‘names’ function.

names(retail_df)

#>  [1] "Order.ID"     "Order.Date"   "Ship.Date"    "Ship.Mode"    "Customer.ID" 
#>  [6] "Segment"      "Product.ID"   "Category"     "Sub.Category" "Product.Name"
#> [11] "Sales"        "Quantity"     "Discount"     "Profit"

The output above and column names list in the Case Description section are the same (consists of 14 columns).

2.1. Data Cleaning & Coertions

After we load the data, we need to check data type for each column inside the dataset to get general picture of our data.

str(retail_df)

#> 'data.frame':    9994 obs. of  14 variables:
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : chr  "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
#>  $ Ship.Date   : chr  "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
#>  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#>  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Retail dataset has 10 ‘chr’, 3 ‘num’, and 1 ‘int’ data types, but we know that some columns have wrong data types. That’s why we need to do adjustment on ‘Order.Date’, ‘Ship.Date’, ‘Ship.Mode’, ‘Segment’, ‘Category’, ‘Sub.Category’ to get the correct data types.

library(lubridate)
retail_df$Order.Date<- mdy(retail_df$Order.Date)
retail_df$Ship.Date <- mdy(retail_df$Ship.Date)
retail_df$Ship.Mode <- as.factor(retail_df$Ship.Mode)
retail_df$Segment <- as.factor(retail_df$Segment)
retail_df$Category <- as.factor(retail_df$Category)
retail_df$Sub.Category <- as.factor(retail_df$Sub.Category)
str(retail_df)

#> 'data.frame':    9994 obs. of  14 variables:
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
#>  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
#>  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#>  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Looks like all of column names in retail dataset already in correct data types.
Next, we will check whether our dataset has missing values or not.

colSums(is.na(retail_df))

#>     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID      Segment 
#>            0            0            0            0            0            0 
#>   Product.ID     Category Sub.Category Product.Name        Sales     Quantity 
#>            0            0            0            0            0            0 
#>     Discount       Profit 
#>            0            0

anyNA(retail_df)

#> [1] FALSE

Looks like the dataset doesn’t have any missing value since all number below column names is zero and the output of ‘anyNA’ function is FALSE. That means our dataset is ready to be processed and analyzed.

3. Exploratory Data Analysis

We start by check summary of our data by using ‘summary’ function.

summary(retail_df)

#>    Order.ID           Order.Date           Ship.Date         
#>  Length:9994        Min.   :2014-01-03   Min.   :2014-01-07  
#>  Class :character   1st Qu.:2015-05-23   1st Qu.:2015-05-27  
#>  Mode  :character   Median :2016-06-26   Median :2016-06-29  
#>                     Mean   :2016-04-30   Mean   :2016-05-03  
#>                     3rd Qu.:2017-05-14   3rd Qu.:2017-05-18  
#>                     Max.   :2017-12-30   Max.   :2018-01-05  
#>                                                              
#>           Ship.Mode    Customer.ID               Segment      Product.ID       
#>  First Class   :1538   Length:9994        Consumer   :5191   Length:9994       
#>  Same Day      : 543   Class :character   Corporate  :3020   Class :character  
#>  Second Class  :1945   Mode  :character   Home Office:1783   Mode  :character  
#>  Standard Class:5968                                                           
#>                                                                                
#>                                                                                
#>                                                                                
#>             Category         Sub.Category  Product.Name      
#>  Furniture      :2121   Binders    :1523   Length:9994       
#>  Office Supplies:6026   Paper      :1370   Class :character  
#>  Technology     :1847   Furnishings: 957   Mode  :character  
#>                         Phones     : 889                     
#>                         Storage    : 846                     
#>                         Art        : 796                     
#>                         (Other)    :3613                     
#>      Sales              Quantity        Discount          Profit         
#>  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
#>  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
#>  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
#>  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
#>  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
#>  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976  
#>

Insight from Data Summary:
1. First order date was occured in Jan 2014, In 4 year, the company can booked 9994 transactions.
2. The most favorable Ship mode by customers is Standard Class.
3. More than 50% customers coming from Consumer Segment while Office supplies category dominates the market.
4. Mean value of sales is 229.858 and there is huge gap from 3rd Quartile sales value with Max sales value (~22k difference). That means there is possibility of outlier value in Sales column.
5. The maximum discount offered by the seller is 80%.
6. The maximum quantity ordered by the buyer is 14 units.
7. There is possibility of outlier in Profit columns since there is huge gap between mean profit value compare to minimum and maximum value.

3.1 EDA Segment vs Profit

Since we have hypothesis of outlier appearance from Sales and Profit, we will conduct investigation where the outlier coming from. We start by analyzing relationship between segment and profit.
Statistic for Consumer Segment:

summary(retail_df[retail_df$Segment=='Consumer',][,c('Segment','Profit')])

#>         Segment         Profit         
#>  Consumer   :5191   Min.   :-6599.978  
#>  Corporate  :   0   1st Qu.:    1.584  
#>  Home Office:   0   Median :    8.375  
#>                     Mean   :   25.837  
#>                     3rd Qu.:   28.607  
#>                     Max.   : 6719.981

boxplot(retail_df[retail_df$Segment=='Consumer',][,c('Profit')])

hist(retail_df[retail_df$Segment=='Consumer',][,c('Profit')],col = 'green')

Statistic for Corporate Segment:

summary(retail_df[retail_df$Segment=='Corporate',][,c('Segment','Profit')])

#>         Segment         Profit         
#>  Consumer   :   0   Min.   :-3839.990  
#>  Corporate  :3020   1st Qu.:    1.928  
#>  Home Office:   0   Median :    8.694  
#>                     Mean   :   30.457  
#>                     3rd Qu.:   29.843  
#>                     Max.   : 8399.976

boxplot(retail_df[retail_df$Segment=='Corporate',][,c('Profit')])

hist(retail_df[retail_df$Segment=='Corporate',][,c('Profit')],col = 'green')

Statistic for Home Office Segment:

summary(retail_df[retail_df$Segment=='Home Office',][,c('Segment','Profit')])

#>         Segment         Profit         
#>  Consumer   :   0   Min.   :-3399.980  
#>  Corporate  :   0   1st Qu.:    1.984  
#>  Home Office:1783   Median :    9.102  
#>                     Mean   :   33.819  
#>                     3rd Qu.:   31.560  
#>                     Max.   : 3919.989

boxplot(retail_df[retail_df$Segment=='Home Office',][,c('Profit')])

hist(retail_df[retail_df$Segment=='Home Office',][,c('Profit')],col = 'green')

Data Population from positive profit row for each Segment:

aggregate(Profit~Segment,retail_df[(retail_df$Profit>=0),],length)

Data Population from negative profit (loss) row for each Segment:

aggregate(Profit~Segment,retail_df[(retail_df$Profit<0),],length)

Percentage of negative profit from sales of Segment:

aggregate(Profit~Segment,retail_df[(retail_df$Profit<0),],length)$Profit/(aggregate(Profit~Segment,retail_df[(retail_df$Profit<0),],length)$Profit+aggregate(Profit~Segment,retail_df[(retail_df$Profit>=0),],length)$Profit)

#> [1] 0.193219 0.184106 0.174986

Standard Deviation (SD) profit for each segment category:

aggregate(Profit~Segment,retail_df,sd)

Variance profit for each segment category:

aggregate(Profit~Segment,retail_df,var)

Insights:
1. All of segment have similar profit distribution profile with many outlier data in both upper and lower section.
2. Most of profit data population located in between -1000 & 1000 for every segment of the company.
3. Consumer, Corporate, and Home Office Segment also have similar percentage of loss.
4. The data has large data spread based on variance value of profit for each segment category and the spread still acceptable because SD value for each segment category ~200 which means still considered as low.

3.2 EDA Segment vs Sales

Statistic for Consumer Segment:

summary(retail_df[retail_df$Segment=='Consumer',][,c('Segment','Sales')])

#>         Segment         Sales          
#>  Consumer   :5191   Min.   :    0.444  
#>  Corporate  :   0   1st Qu.:   17.264  
#>  Home Office:   0   Median :   53.720  
#>                     Mean   :  223.734  
#>                     3rd Qu.:  206.106  
#>                     Max.   :13999.960

boxplot(retail_df[retail_df$Segment=='Consumer',][,c('Sales')])

hist(retail_df[retail_df$Segment=='Consumer',][,c('Sales')],col = 'green')

Statistic for Corporate Segment:

summary(retail_df[retail_df$Segment=='Corporate',][,c('Segment','Sales')])

#>         Segment         Sales          
#>  Consumer   :   0   Min.   :    0.556  
#>  Corporate  :3020   1st Qu.:   17.469  
#>  Home Office:   0   Median :   56.540  
#>                     Mean   :  233.823  
#>                     3rd Qu.:  212.895  
#>                     Max.   :17499.950

boxplot(retail_df[retail_df$Segment=='Corporate',][,c('Sales')])

hist(retail_df[retail_df$Segment=='Corporate',][,c('Sales')],col = 'green')

Statistic for Home Office Segment:

summary(retail_df[retail_df$Segment=='Home Office',][,c('Segment','Sales')])

#>         Segment         Sales         
#>  Consumer   :   0   Min.   :    0.99  
#>  Corporate  :   0   1st Qu.:   17.05  
#>  Home Office:1783   Median :   52.44  
#>                     Mean   :  240.97  
#>                     3rd Qu.:  210.91  
#>                     Max.   :22638.48

boxplot(retail_df[retail_df$Segment=='Home Office',][,c('Sales')])

hist(retail_df[retail_df$Segment=='Home Office',][,c('Sales')],col = 'green')

Standard Deviation (SD) profit for each segment category:

aggregate(Sales~Segment,retail_df,sd)

Variance profit for each segment category:

aggregate(Sales~Segment,retail_df,var)

Insights:
1. Sales data distribution for each segment is skewed to the right with most of data located in 0-1000 class.
2. The biggest sales happened in Home Office segment (22,638.48).
3. Home Office segment has larger value of standard deviation and variance. The company need to look more detail of reason on this kind of fluctuation.

3.3 EDA Category vs Profit

Statistic for Furniture Category:

summary(retail_df[retail_df$Category=='Furniture',][,c('Category','Profit')])

#>             Category        Profit         
#>  Furniture      :2121   Min.   :-1862.312  
#>  Office Supplies:   0   1st Qu.:  -12.849  
#>  Technology     :   0   Median :    7.775  
#>                         Mean   :    8.699  
#>                         3rd Qu.:   33.727  
#>                         Max.   : 1013.127

boxplot(retail_df[retail_df$Category=='Furniture',][,c('Profit')])

hist(retail_df[retail_df$Category=='Furniture',][,c('Profit')],col = 'green')

Statistic for Office Supplies Category:

summary(retail_df[retail_df$Category=='Office Supplies',][,c('Category','Profit')])

#>             Category        Profit         
#>  Furniture      :   0   Min.   :-3701.893  
#>  Office Supplies:6026   1st Qu.:    2.102  
#>  Technology     :   0   Median :    6.882  
#>                         Mean   :   20.327  
#>                         3rd Qu.:   19.416  
#>                         Max.   : 4946.370

boxplot(retail_df[retail_df$Category=='Office Supplies',][,c('Profit')])

hist(retail_df[retail_df$Category=='Office Supplies',][,c('Profit')],col = 'green')

Statistic for Technology Category:

summary(retail_df[retail_df$Category=='Technology',][,c('Category','Profit')])

#>             Category        Profit         
#>  Furniture      :   0   Min.   :-6599.978  
#>  Office Supplies:   0   1st Qu.:    5.204  
#>  Technology     :1847   Median :   25.018  
#>                         Mean   :   78.752  
#>                         3rd Qu.:   74.895  
#>                         Max.   : 8399.976

boxplot(retail_df[retail_df$Category=='Technology',][,c('Profit')])

hist(retail_df[retail_df$Category=='Technology',][,c('Profit')],col = 'green')

Data Population from positive profit row for each category:

aggregate(Profit~Category,retail_df[(retail_df$Profit>=0),],length)

Data Population from negative profit (loss) row for each category:

aggregate(Profit~Category,retail_df[(retail_df$Profit<0),],length)

Percentage of negative profit from sales of category:

aggregate(Profit~Category,retail_df[(retail_df$Profit<0),],length)$Profit/(aggregate(Profit~Category,retail_df[(retail_df$Profit<0),],length)$Profit+aggregate(Profit~Category,retail_df[(retail_df$Profit>=0),],length)$Profit)

#> [1] 0.3366337 0.1470295 0.1467244

Standard Deviation (SD) profit for each Category:

aggregate(Profit~Category,retail_df,sd)

Variance profit for each Category:

aggregate(Profit~Category,retail_df,var)

Insights:
1. All of category have similar profit distribution profile with many outlier data in both upper and lower section.
2. Most of profit data population located in between -500 & 500 for every category of product.
3. Furniture Category has the largest percentage of loss (~33.66%).
4. The largest profit transaction was coming from Technology Category(8,399.976) while the largest losss transaction also coming from Technology Category (-6.599.978) 5. Technology Category has the largest value of standard deviation and variance. The company need to look more detail of reason on this kind of fluctuation.

3.4 EDA Category vs Sales

Statistic for Furniture Category:

summary(retail_df[retail_df$Category=='Furniture',][,c('Category','Sales')])

#>             Category        Sales         
#>  Furniture      :2121   Min.   :   1.892  
#>  Office Supplies:   0   1st Qu.:  47.040  
#>  Technology     :   0   Median : 182.220  
#>                         Mean   : 349.835  
#>                         3rd Qu.: 435.168  
#>                         Max.   :4416.174

boxplot(retail_df[retail_df$Category=='Furniture',][,c('Sales')])

hist(retail_df[retail_df$Category=='Furniture',][,c('Sales')],col = 'green')

Statistic for Office Supplies Category:

summary(retail_df[retail_df$Category=='Office Supplies',][,c('Category','Sales')])

#>             Category        Sales         
#>  Furniture      :   0   Min.   :   0.444  
#>  Office Supplies:6026   1st Qu.:  11.760  
#>  Technology     :   0   Median :  27.418  
#>                         Mean   : 119.324  
#>                         3rd Qu.:  79.920  
#>                         Max.   :9892.740

boxplot(retail_df[retail_df$Category=='Office Supplies',][,c('Sales')])

hist(retail_df[retail_df$Category=='Office Supplies',][,c('Sales')],col = 'green')

Statistic for Technology Category:

summary(retail_df[retail_df$Category=='Technology',][,c('Category','Sales')])

#>             Category        Sales         
#>  Furniture      :   0   Min.   :    0.99  
#>  Office Supplies:   0   1st Qu.:   68.02  
#>  Technology     :1847   Median :  166.16  
#>                         Mean   :  452.71  
#>                         3rd Qu.:  448.53  
#>                         Max.   :22638.48

boxplot(retail_df[retail_df$Category=='Technology',][,c('Sales')])

hist(retail_df[retail_df$Category=='Technology',][,c('Sales')],col = 'green')

Standard Deviation (SD) profit for each category:

aggregate(Sales~Category,retail_df,sd)

Variance profit for each category:

aggregate(Sales~Category,retail_df,var)

Insights:
1. Sales data distribution for each Category is skewed to the right with most of data located in 0-1000 class.
2. The biggest sales happened in Technology Category (22,638.48).
3. Technology Category has larger value of standard deviation and variance of sales. The company need to look more detail of reason on this kind of fluctuation.

3.5 Plot Analysis

plot(retail_df$Sales, retail_df$Profit,
     xlab = "Sales",
     ylab = "Profit",
     col = retail_df$Category,
     pch = 18) # mengganti bentuk titik scatterplot
abline(lm(retail_df$Profit ~ retail_df$Sales), col = "red")
legend("topright", legend = levels(retail_df$Category),
       fill = 1:3) # berapa banyak warna yang ditampilkan
title("Correlation Between Sales and Profit")

Based on the plot above, sales and profit has positive correlation value. Furniture Category has maximum range of sales ~5000 and has tendency to create less profit compare to other category.

library(ggplot2)

ggplot(data = aggregate(Quantity~Category,retail_df,sum), mapping = aes(x=  Category, y= Quantity )) +
  geom_col() +# untuk membuat barplot
  labs(
    title="Total Unit Sold For Each Category")+
  theme(plot.title = element_text(hjust=0.5))

Total Unit Sold comes from Office Supplies Category.

ggplot(data = aggregate(Quantity~Segment,retail_df,sum), mapping = aes(x=  Segment, y= Quantity )) +
  geom_col() + # untuk membuat barplot
  labs(
    title="Total Unit Sold For Each Segment")+
  theme(plot.title = element_text(hjust=0.5))

Meanwhile when we look at the Most unit sold from segment, the award goes to consumer segment.

4. Data Manipulation & Transformation

Which category gives the lowest sales number? how much profit?

retail_df[retail_df$Sales == 0.444,]

Answer: Sales 0.444 comes from segment ‘consumer’, office supplies, and got loss 1.11.

Which segment buying the most office supplies?

sales_office_supplies <- retail_df[retail_df$Category=='Office Supplies',]
round(prop.table(table(sales_office_supplies$Segment))*100,2)

#> 
#>    Consumer   Corporate Home Office 
#>       51.89       30.20       17.91

Answer: Consumer Segment = 51.89%

How many times buyers get the highest discount 80% and when it happened

discount<-retail_df[retail_df$Discount==0.8,]
(sort(table(discount$Order.Date),decreasing = T)[1])

#> 2014-07-26 
#>          4

Answer: buyers get 80% discount 4 times and happened on 26 July 2014

Related to number 3, who was the buyers?

discount2<-discount[discount$Order.Date=="2014-07-26",]
(table(discount2$Segment))

#> 
#>    Consumer   Corporate Home Office 
#>           4           0           0

Answer: Buyers were coming from Consumer Segment.

How much total of quantity order from each segment and category, and which is the highest?

xtabs(Quantity~Segment+Category,retail_df)

#>              Category
#> Segment       Furniture Office Supplies Technology
#>   Consumer         4166           11758       3597
#>   Corporate        2495            7018       2095
#>   Home Office      1367            4130       1247

plot(xtabs(Quantity~Segment+Category,retail_df))

Answer: Based on the output, consumer segment with office supply category make the highest quantity order (11,758 units)

Median of quantity for every segment and category?

aggregate(Quantity~Segment+Category,retail_df,median)

Answer: Median of quantity for every segment and category = 3

How much Total quantity order since the company start selling the product?

sum(retail_df$Quantity)

#> [1] 37873

Answer: 37,873 units

How much profit in 4 years?

sum(retail_df$Profit)

#> [1] 286397

sum(retail_df$Sales)

#> [1] 2297201

Answer: 286,397

Discount Percentage (0%)(0-20%)(>20%)

zero_disc<- retail_df[retail_df$Discount==0,]
knitr::kable(table(zero_disc$Discount))

Var1	Freq
0	4798

more_disc <- retail_df[retail_df$Discount>0 & retail_df$Discount<=0.8,]
knitr::kable(table(more_disc$Discount))

Var1	Freq
0.1	94
0.15	52
0.2	3657
0.3	227
0.32	27
0.4	206
0.45	11
0.5	66
0.6	138
0.7	418
0.8	300

Answer: According to the output above, almost half amount of transaction was done without any discount given by the company

Category which generate the highest profit?

xtabs(Profit~ Category, retail_df)

#> Category
#>       Furniture Office Supplies      Technology 
#>        18451.27       122490.80       145454.95

Answer: Technology Category

Highest Loss in which transaction?

retail_df[retail_df$Profit==-6599.978,]

Answer: The transaction happened on 25 Nov 2016 with Order.ID CA-2016-108196. The company sold 5 units product of Technology Category with 70% discount

Highest Profit in Segment?

xtabs(Profit~ Segment, retail_df)

#> Segment
#>    Consumer   Corporate Home Office 
#>   134119.21    91979.13    60298.68

Answer: Consumer Segment

Profit based on Ship Mode + Segment?

xtabs(Profit~Segment+Ship.Mode,retail_df)

#>              Ship.Mode
#> Segment       First Class  Same Day Second Class Standard Class
#>   Consumer      21374.044  9874.205    24946.911      77924.049
#>   Corporate     14464.472  1818.142    18225.713      57470.807
#>   Home Office   13131.324  4199.412    14274.011      28693.932

plot(xtabs(Profit~Segment+Ship.Mode,retail_df))

Answer: Consumer Segment with Standard Class Ship Mode.

The highest quantity seller based on Category

xtabs(Quantity~Category,retail_df)

#> Category
#>       Furniture Office Supplies      Technology 
#>            8028           22906            6939

Answer: Office Supply Category

5. Explanatory Text & Business Recommendation

In 4 years, the company have generate 9994 transaction with 37,873 units sold, total sales 2,297,201, and total profit 286,397. The company sells 3 main category of products: Furniture, Office Supplies, and Technology. Sales distribution plot for each segment and category are skewed to the right. Technology category generates the highest profit (145,454.95) but also recorded the highest transaction loss with amount 6,599.987. Technology Category sales and profit have the largest fluctuation and we suggest for the company to investigate in details the reason of this fluctuation. Our suspect is due to bigger discount applied in Technology Category since the largest discount was all applied in Technology Category transactions. Maybe the company can reduce the discount in order to make this category more stable. Hopefully the company can create solution to reduce the fluctuation even though Technology Category still sit at the peak of company selling price. The company also need to come up with new strategy for Furniture Category since Furniture has the largest loss percentage compare to other category. The buyers mostly come to the company to buy office supply. Based on the analysis, office supply category unit sold is 2-3 times higher from other company. Make sure the warehouse space and supply can accommodate flow of product.

Buyer who bought the company product was coming from 3 segments: Consumer, Corporate, and Home Office. Consumer segment has the highest quantity of product sold and creates the highest amount of profit (134,119.21). Shipping option consist of 4 types : First Class, Same day, Second class, and Standart class. Based on data, from all segment, most of them are choosing Standart Class as their shipping option. Consumer segment is the most highest segment who choosing Standart class mode.

A work by Ade Anggi N S

adeanggins@gmail.com