Data Background

This data contains sales records including discounts, profits of a retail company that sells Office Supplies, Furniture and Technology goods. The buyers come from various segments such as Corporate, Home Office to individual consumers. This company can also provide the best option for your shipment that suits your needs.

Import Data .csv

We will analyze the retail.csv data contained in the data_input folder. Use the read.csv() function to read the CSV file to R.

retail <- read.csv(file="data_input/retail.csv")
retail

Inspect Data

after we have successfully imported our data, we will do a data inspection to find out contents our data, actually we can use the view() function to view the contents of the data but it will take time to see the whole data so we use a function that sees the head() and tail() data’s only.

head(retail)

tail(retail)

dim(retail)

#> [1] 9994   15

The functions anteNA() and is.na() are used to find out whether there is a missing value in the data, functions that sound simple but are very crucial if they are not carried out because they will affect our work process in analysis, for example during calculations.

anyNA(retail)

#> [1] FALSE

colSums(is.na(retail))

#>       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
#>            0            0            0            0            0            0 
#>      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
#>            0            0            0            0            0            0 
#>     Quantity     Discount       Profit 
#>            0            0            0

From the results of our inspections, we get some information that can support our analysis, as follows:

data from each column in retail.csv.
date format that is on the data.
the number of columns 15 and rows 9994.
From the data we inspected, there is no missing value in each column.

Data Cleansing

Data cleansing is done to make the data we have appropriate the data we need for analysis purposes.

str(retail)

#> 'data.frame':    9994 obs. of  15 variables:
#>  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
#>  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
#>  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#>  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

From this result, we find some of data type not in the corect type. we need to convert it into corect type (data corection)

retail$Row.ID <- as.character(retail$Row.ID)
retail$Segment <- as.factor(retail$Segment)
retail$Ship.Mode <- as.factor(retail$Ship.Mode)
retail$Category <- as.factor(retail$Category)
retail$Sub.Category <- as.factor(retail$Sub.Category)

str(retail)

#> 'data.frame':    9994 obs. of  15 variables:
#>  $ Row.ID      : chr  "1" "2" "3" "4" ...
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
#>  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
#>  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#>  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

there are still inaccurate data types, namely Order.Date and Ship.Date, we will change it by using the lubridate library.

library(lubridate)

retail$Order.Date <- mdy(retail$Order.Date)

retail$Ship.Date <- mdy(retail$Ship.Date)

str(retail)

#> 'data.frame':    9994 obs. of  15 variables:
#>  $ Row.ID      : chr  "1" "2" "3" "4" ...
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
#>  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
#>  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#>  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Now, Each of column already changed into desired data type dataset is ready to be processed and analyzed.

Data Explanation

summary(retail)

#>     Row.ID            Order.ID           Order.Date        
#>  Length:9994        Length:9994        Min.   :2014-01-03  
#>  Class :character   Class :character   1st Qu.:2015-05-23  
#>  Mode  :character   Mode  :character   Median :2016-06-26  
#>                                        Mean   :2016-04-30  
#>                                        3rd Qu.:2017-05-14  
#>                                        Max.   :2017-12-30  
#>                                                            
#>    Ship.Date                   Ship.Mode    Customer.ID       
#>  Min.   :2014-01-07   First Class   :1538   Length:9994       
#>  1st Qu.:2015-05-27   Same Day      : 543   Class :character  
#>  Median :2016-06-29   Second Class  :1945   Mode  :character  
#>  Mean   :2016-05-03   Standard Class:5968                     
#>  3rd Qu.:2017-05-18                                           
#>  Max.   :2018-01-05                                           
#>                                                               
#>         Segment      Product.ID                   Category   
#>  Consumer   :5191   Length:9994        Furniture      :2121  
#>  Corporate  :3020   Class :character   Office Supplies:6026  
#>  Home Office:1783   Mode  :character   Technology     :1847  
#>                                                              
#>                                                              
#>                                                              
#>                                                              
#>       Sub.Category  Product.Name           Sales              Quantity    
#>  Binders    :1523   Length:9994        Min.   :    0.444   Min.   : 1.00  
#>  Paper      :1370   Class :character   1st Qu.:   17.280   1st Qu.: 2.00  
#>  Furnishings: 957   Mode  :character   Median :   54.490   Median : 3.00  
#>  Phones     : 889                      Mean   :  229.858   Mean   : 3.79  
#>  Storage    : 846                      3rd Qu.:  209.940   3rd Qu.: 5.00  
#>  Art        : 796                      Max.   :22638.480   Max.   :14.00  
#>  (Other)    :3613                                                         
#>     Discount          Profit         
#>  Min.   :0.0000   Min.   :-6599.978  
#>  1st Qu.:0.0000   1st Qu.:    1.729  
#>  Median :0.2000   Median :    8.666  
#>  Mean   :0.1562   Mean   :   28.657  
#>  3rd Qu.:0.2000   3rd Qu.:   29.364  
#>  Max.   :0.8000   Max.   : 8399.976  
#>

Summary :

First order occured in Jan 2014
Standard Class was the most popular one for shipping option and Same day delivery was the lowest one
Our Retail’s has the most buyers coming from cunsumer segment and home office segment was the lowest
Office Supply category was the best seller compare to others
In Sub category, Other stuff was the most favorite one then followed by Binders and paper. Art is the lowest and followed by Phones and Storage.
Our retail’s gained the average sales at 229.858; with max sales value at 22638.480 and minumum sales value at 0.444
Max quantity order was 14 within 4 years but average at 3.79
Our retail’s amount of discount was 80%, but mean number of discount around 15%
Our retail’s had loss at 6599 and gained 8399 as their profit with average profit at 28.657

Check the Outlier within profit

boxplot(retail$Profit)

From result above, we find posibilities for the outliers, but we need to see or calculate the sd value so that we can tolerate our data or not.

aggregate(Profit~Segment,retail,sd)

From the result our Sd all segment is around 200 my oppinion its still be tolerated, so the process may continue.

Data Manipulation and Business Inqueries

our company wants to evaluate sales results by segment to plan in the following period in order to increase our sales

Which segment gives the lowest and the highest sales number and how much profit?

sales_agg <- aggregate(formula=cbind(Sales,Profit)~Segment, data=retail, FUN=sum)

sales_agg[order(sales_agg$Sales, decreasing = T),]

The aggregation results show that our sales and profits for 4 years are quite good in the Consumer segment compared to other segments during these 4 years.

how is the frequency of category purchases in each segment?

freq_retail <- table(retail$Category, retail$Segment)
freq_retail

#>                  
#>                   Consumer Corporate Home Office
#>   Furniture           1113       646         362
#>   Office Supplies     3127      1820        1079
#>   Technology           951       554         342

Category Office Supplies were the most purchased in each segment, indicating that Office Supplies were the most needed or in demand during the past 4 years.

referring to the statement above, what subcategories were purchased in the Office Supplies category along with the sales sales quantity?

office_supplies <- retail[retail$Category=="Office Supplies",]
qty_officesupplies <- aggregate(formula=Quantity~Sub.Category, data=office_supplies, FUN=sum)

qty_officesupplies[order(qty_officesupplies$Quantity, decreasing = T),]

from our calculations on the category office supplies subcategory the most interested are Binders and the least interested or purchased are Supplies.

refers to the calculation of number 2 the technology category has the lowest value, sort the sales qty from the smallest to the largest

Technology <- retail[retail$Category=="Technology",]
qty_techonlogy <- aggregate(formula=Quantity~Sub.Category, data=Technology, FUN=sum)

qty_techonlogy[order(qty_techonlogy$Quantity, decreasing = F),]

These results show that the lowest sales in technology are Copiers, in my opinion, this is because photocopy machines in this era have slightly reduced their use, where currently we use digital data more often that can be directly duplicated digitally, and there are several the movement to reduce paper use made hardcopy duplication very rare.

which year did the highest order quantity occur?

retail$year_order <- format(retail$Order.Date,"%Y")
retail$yearmonth_order <- format(retail$Order.Date,"%Y-%m")

head(retail)

year_tot <- aggregate(formula=Quantity~year_order, data=retail, FUN=sum)
year_tot[order(year_tot$Quantity, decreasing = T),]

In 2017 there was a rapid increase compared to the previous year, from this maybe we can replicate what we did in 2017 the next period.

Average profit every year_order

year_avg <- aggregate(formula=Profit~year_order, data=retail, FUN=mean)
year_avg[order(year_avg$Profit, decreasing = T),]

despite in 2017 there was the highest order quantity, but if seen from the average profit in 2017 it was only in the 3rd position, my assumption this is because there is an outlier profit in 2017.

Show Trend profit Month by Month at 2017 to see are there outlier profit in 2017

library(ggplot2)
month_avg <- aggregate(formula=Profit~yearmonth_order, data=retail[retail$year_order == 2017,], FUN=sum)
ggplot(month_avg, aes(x = yearmonth_order, y = Profit)) +
  geom_line(group = 1) + 
  geom_point()

Referring to statement number 5, it was proven that there was a high spike in certain months that occurred in 2017-03, this happened because there were certain events in that month that caused high demand.

We want to evaluate the services we provide in terms of shipping goods, what is the median delivery time based on Ship Mode?

shipping <- aggregate(formula=Order.ID~Ship.Mode, data=retail, FUN=length)
shipping

Standard class is the customer’s favorite shipping class, in my opinion it’s because price given is cheaper than other shipping modes.

retail$Duration <- difftime(time1=retail$Ship.Date, time2 = retail$Order.Date, units = "days")
shipping_duration <- aggregate(formula=Duration~Ship.Mode, data=retail, FUN=median)
shipping_duration

We will use this median value as a standard reference for shipping each ship mode

How many customers use the First Class ship mode whose delivery is below the standard we set?

nrow(retail[retail$Ship.Mode =="First Class" & retail$Duration > 2,c("Ship.Mode","Duration")])

#> [1] 624

There are still 624 orders that are still below the time we specify.

How many customers use the Same Day ship mode whose delivery is below the standard we set?

nrow(retail[retail$Ship.Mode =="Same Day" & retail$Duration > 0,c("Ship.Mode","Duration")])

#> [1] 24

There are still 24 orders that are still below the time we specify.

How many customers use the Second Class ship mode whose delivery is below the standard we set?

nrow(retail[retail$Ship.Mode =="Second Class" & retail$Duration > 3,c("Ship.Mode","Duration")])

#> [1] 800

There are still 800 orders that are still below the time we specify.

How many customers use the Standard Class ship mode whose delivery is below the standard we set?

nrow(retail[retail$Ship.Mode =="Standard Class" & retail$Duration > 5,c("Ship.Mode","Duration")])

#> [1] 1824

There are still 1824 orders that are still below the time we specify.

**How much Total quantity order since our retail start selling the product?*

sum(retail$Quantity)

#> [1] 37873

Answer : 37873 items

**How much Total Profit order since our retail start selling the product?*

sum(retail$Profit)

#> [1] 286397

Answer : 286397 items

how much total profit and qty sold when there is a discount for each year?

discount <- aggregate(formula=cbind(Quantity,Profit)~year_order, data=retail[retail$Discount>0,], FUN=sum)
discount

From this result, the discount that we give to consumers is less effective because if we look at the total profit, we see a loss every year.

how much total profit and qty sold when there is no discount for each year?

no_discount <- aggregate(formula=cbind(Quantity,Profit)~year_order, data=retail[retail$Discount==0,], FUN=sum)
no_discount

refers to statement no.14 is proven that the discount that we make is less effective if we look at the total quantity sold that is not much different, even though we are profitable in the absence of a discount, we actually generate a large profit.

Conclusion

In 4 years, since bigining 2014-2017 Our Retail has sold a total of 37873 items with a total profit of 286397. The company sells 3 main categories: Furniture, Office Supplies and Technology.

of all categories Office Supplies provide the biggest profit from each segment, the most desirable sub profit is Binders. Unfortunately, in the Technology category, sales are the smallest in each segment, in my opinion, this could be due to the nature of our consumers who buy technology for long-term use, and because in the Copiers subcategory their use has decreased slightly due to the shift in digital document duplication.

In 2017 there was a rapid increase sales compared to the previous year, despite in 2017 there was the highest order quantity, but if seen from the average profit in 2017 it was only in the 3rd position, my assumption this is because there is an outlier profit in 2017 or certain events in that month that caused high demand.

Standard class is the customer’s favorite shipping class, in my opinion it’s because price given is cheaper than other shipping modes. Unfortunately, there are still many delays in each shipping method that we provide if we refer to the standards that we set on above.

If we look at the results of our calculations, discounts are not very effective because the total number of sales with or without discounts does not have a significant difference, it actually gives a loss to our company.

Recomendation

Need to increase market target at Corporate segment and home office segment, we may give them more discount with taking back the profit to company
Make sure all category was safely supplied on the wareshouse specially for ffice supply items.
In terms of improving the quality of service, we must deliver the ordered goods according to the standards we just made:

First Class 2 days
Same Day 0 days
Second Class 3 days
Standard Class 5 days

Programming for Data Science

Ruli Erinton

23 March 2022