1 Introduction

Today, we will be analyzing the retail dataset from PT. Elips, a global retail company that sells office supplies, furniture, and technology items. We will be exploring the data to gain insights into the company’s sales performance, profit and loss, and customer preferences across different segments. We hope you enjoy the analysis and find the insights useful in informing future business decisions.

2 Data Exploratory

To begin our exploratory data analysis of the retail dataset, we need to load the data into R. The dataset is stored in a CSV file named ‘retail.csv’, which we have saved in a folder called ‘data_input’. To load the dataset into R, we have used the ‘read.csv’ function, which reads the CSV file and stores the data in a data frame.

elips<-read.csv("data_input/retail.csv")

In our code, we have created a new data frame called ‘elips’ and used the ‘read.csv’ function to read in the ‘retail.csv’ file from the ‘data_input’ folder. This folder path may vary depending on where you have saved your data file. With the retail data now loaded into R, we can proceed with our exploratory analysis to gain insights into sales performance and customer preferences.

2.1 Data Inspection

After loading the retail dataset into R, we’ll start by inspecting the data to get a better understanding of its structure and content. To do this, we’ve used several functions in R.

Firstly, we used the ‘head’ function to display the first six rows of the data frame ‘elips’, which gives us a quick glimpse of the data.

head(elips)

Next, we used the ‘tail’ function to display the last six rows of the data frame, which is useful for checking if the data has been read in correctly

tail(elips)

We then used the ‘dim’ function to show the dimensions of the data frame, which tells us how many rows and columns are in the data.

dim(elips)

#> [1] 9994   15

Finally, we used the ‘names’ function to display the column names of the data frame, which is helpful for identifying the variables that are present in the dataset.

names(elips)

#>  [1] "Row.ID"       "Order.ID"     "Order.Date"   "Ship.Date"    "Ship.Mode"   
#>  [6] "Customer.ID"  "Segment"      "Product.ID"   "Category"     "Sub.Category"
#> [11] "Product.Name" "Sales"        "Quantity"     "Discount"     "Profit"

Through our inspection of the ‘elips’ retail dataset, we can gather some important information about the data structure and content.

elips data contain 9994 of rows and 15 of coloumns
Each of column name : “Row.ID”, “Order.ID”, “Order.Date”, “Ship.Date”, “Ship.Mode”, “Customer.ID”, “Segment”,“Product.ID”, “Category”, “Sub.Category”, “Product.Name”, “Sales”, “Quantity”, “Discount”, “Profit”

2.2 Data Cleansing & Coertions

After inspecting the ‘elips’ retail dataset, we used the ‘str’ function to check the data types of each column.

str(elips)

#> 'data.frame':    9994 obs. of  15 variables:
#>  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
#>  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
#>  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#>  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

We found that some columns are not in the correct data type, and we need to convert them to the correct data type using data coercion.

elips$Row.ID <- as.character(elips$Row.ID)
elips$Order.ID<- as.character (elips$Order.ID)
elips$Order.Date<- as.Date(elips$Order.Date, "%m/%d/%y")
elips$Ship.Date <- as.Date(elips$Ship.Date, "%m/%d/%y")
elips$Customer.ID<-as.character(elips$Customer.ID)
elips$Product.ID<- as.character(elips$Product.ID)
elips$Product.Name<-as.character(elips$Product.Name)

str(elips)

#> 'data.frame':    9994 obs. of  15 variables:
#>  $ Row.ID      : chr  "1" "2" "3" "4" ...
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
#>  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
#>  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#>  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

After performing data coercion on the ‘elips’ retail dataset, we successfully converted each column to the desired data type.

To ensure data quality, we need to check if the ‘elips’ retail dataset contains any missing values.

colSums(is.na(elips))

#>       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
#>            0            0            0            0            0            0 
#>      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
#>            0            0            0            0            0            0 
#>     Quantity     Discount       Profit 
#>            0            0            0

anyNA(elips)

#> [1] FALSE

We are glad to report that there are no missing values present in the dataset. This means that our analysis can proceed with complete data.

We used subsetting to remove columns 1 and 2 from the ‘elips’ retail dataset because they are not needed for our analysis. We saved the subsetted dataset as the ‘elips_clean’ variable. This variable now contains the relevant columns that we will use for our exploratory data analysis.

elips_clean<- elips[,c(3:15)]
head(elips_clean)

Now that we have completed the data cleansing and subsetting steps, the ‘Elips’ dataset is ready to be processed and analyzed.

3 Data Explanation

We used the summary(elips_clean) function to obtain summary statistics for the ‘Elips’ retail dataset.

summary(elips_clean)

#>    Order.Date           Ship.Date           Ship.Mode        
#>  Min.   :2014-01-03   Min.   :2014-01-07   Length:9994       
#>  1st Qu.:2015-05-23   1st Qu.:2015-05-27   Class :character  
#>  Median :2016-06-26   Median :2016-06-29   Mode  :character  
#>  Mean   :2016-04-30   Mean   :2016-05-03                     
#>  3rd Qu.:2017-05-14   3rd Qu.:2017-05-18                     
#>  Max.   :2017-12-30   Max.   :2018-01-05                     
#>  Customer.ID          Segment           Product.ID          Category        
#>  Length:9994        Length:9994        Length:9994        Length:9994       
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>  Sub.Category       Product.Name           Sales              Quantity    
#>  Length:9994        Length:9994        Min.   :    0.444   Min.   : 1.00  
#>  Class :character   Class :character   1st Qu.:   17.280   1st Qu.: 2.00  
#>  Mode  :character   Mode  :character   Median :   54.490   Median : 3.00  
#>                                        Mean   :  229.858   Mean   : 3.79  
#>                                        3rd Qu.:  209.940   3rd Qu.: 5.00  
#>                                        Max.   :22638.480   Max.   :14.00  
#>     Discount          Profit         
#>  Min.   :0.0000   Min.   :-6599.978  
#>  1st Qu.:0.0000   1st Qu.:    1.729  
#>  Median :0.2000   Median :    8.666  
#>  Mean   :0.1562   Mean   :   28.657  
#>  3rd Qu.:0.2000   3rd Qu.:   29.364  
#>  Max.   :0.8000   Max.   : 8399.976

The summary of Elips dataset shows several insights.

The initial occurrence of the first order was in January 2014.
Standard Class was the most favored shipping option, while Same-day delivery was the least popular.
PT. Elips has the highest number of buyers from the consumer segment, and the home office segment has the lowest.
The Office Supply category was the top-performing category compared to others.
In terms of subcategories, Other stuff was the most preferred, followed by Binders and Paper. Art had the lowest preference.
PT. Elips achieved an average sales value of 229.858, with the highest sales value at 22638.480 and the lowest at 0.444.
The maximum quantity ordered within four years was 14, with an average of 3.79.
The maximum discount offered by PT. Elips was 80%, but the average discount offered was around 15%.
PT. Elips experienced a loss of 6599 and a profit of 8399, with an average profit of 28.657.

As a preliminary step in our exploratory data analysis, we checked for any outliers in the ‘Profit’ column of the ‘Elips’ dataset. Outliers are data points that deviate significantly from the rest of the distribution and can skew our analysis results.

aggregate(Profit~Segment,elips_clean,mean)

aggregate(Profit~Segment,elips_clean,var)

aggregate(Profit~Segment,elips_clean,sd)

boxplot(elips_clean$Profit)

After checking for potential outliers in the ‘Profit’ column of the ‘Elips’ dataset, we found several data points that fell outside the whiskers of the boxplot, indicating possible outliers. However, upon further investigation, we observed that the standard deviation value for the ‘Profit’ column was around 200, which is still within an acceptable range for the dataset.

4 Data Manipulation & Transformation

Which category has the lowest sales figure, and what is the corresponding profit amount?

elips[elips$Sales == 0.444,]

The category with the lowest sales figure is “Office Supplies,” with a sales value of 0.444 coming from the “Consumer” segment. This category resulted in a loss of 1.11.

What segment has the highest purchasing activity for office supplies?

os <- elips_clean[elips_clean$Category=="Office Supplies",]
round(prop.table(table(os$Segment))*100,2)

#> 
#>    Consumer   Corporate Home Office 
#>       51.89       30.20       17.91

The Consumer segment has the highest percentage at 51.89%.

What is the frequency of buyers receiving the maximum discount of 80%, and when did this occur?

dis<-elips_clean[elips_clean$Discount==0.8,]
(sort(table(dis$Order.Date),decreasing = T)[1])

#> 2014-07-26 
#>          4

The maximum discount of 80% occurred in July 2014, and it happened four times.

Related to number 3, who was the buyers?

dis2<-dis[dis$Order.Date=="2014-07-26",]
(table(dis2$Segment))

#> 
#> Consumer 
#>        4

The entire buyers are from the Consumer segment.

What is the total quantity ordered for each segment and category, and which one has the highest quantity?

xtabs(Quantity~Segment+Category,elips_clean)

#>              Category
#> Segment       Furniture Office Supplies Technology
#>   Consumer         4166           11758       3597
#>   Corporate        2495            7018       2095
#>   Home Office      1367            4130       1247

plot(xtabs(Quantity~Segment+Category,elips_clean))

heatmap(xtabs(Quantity~Segment+Category, elips_clean), 
        cexRow = 0.8, cexCol = 0.8, scale = "column", Colv = NA, Rowv = NA)

Based on the previously mentioned results, the Consumer segment with the Office Supply category has the highest quantity ordered with a total of 11,758.

What is the median quantity for each segment and category?

aggregate(Quantity~Segment+Category,elips_clean,median)

The median quantity for each segment and category is 3.

What is the total quantity ordered since PT. Elips started selling the product?

sum(elips_clean$Quantity)

#> [1] 37873

The total quantity ordered since PT. Elips started selling the product is 37,873 items.

What is the total profit generated in four years?

sum(elips_clean$Profit)

#> [1] 286397

The total profit generated is 286397.

Discount Percentage (0%) (0-20%) (>20%)

nol<- elips_clean[elips_clean$Discount==0,]
knitr::kable(table(nol$Discount))

Var1	Freq
0	4798

more0 <- elips_clean[elips_clean$Discount>0 & elips_clean$Discount<=0.8,]
knitr::kable(table(more0$Discount))

Var1	Freq
0.1	94
0.15	52
0.2	3657
0.3	227
0.32	27
0.4	206
0.45	11
0.5	66
0.6	138
0.7	418
0.8	300

Based on this data, 50% of items were sold without any discount given.

Which category generates the highest profit?

xtabs(Profit~ Category, elips_clean)

#> Category
#>       Furniture Office Supplies      Technology 
#>        18451.27       122490.80       145454.95

The Technology category generates the highest profit.

What was the highest loss in a transaction?

elips[elips$Profit==-6599.978,]

The highest loss occurred in 2016 when five machine items were sold at a 70% discount.

Which segment had the highest profit?

xtabs(Profit~ Segment, elips_clean)

#> Segment
#>    Consumer   Corporate Home Office 
#>   134119.21    91979.13    60298.68

The Consumer segment had the highest profit compared to other segments.

What is the sales figure for each category?

xtabs(Sales~ Category, elips_clean)

#> Category
#>       Furniture Office Supplies      Technology 
#>        741999.8        719047.0        836154.0

plot(xtabs(Profit~Segment+Ship.Mode,elips_clean))

The Technology category has the highest sales figure compared to others.

What is the profit based on ship mode and segment?

xtabs(Profit~Segment+Ship.Mode,elips_clean)

#>              Ship.Mode
#> Segment       First Class  Same Day Second Class Standard Class
#>   Consumer      21374.044  9874.205    24946.911      77924.049
#>   Corporate     14464.472  1818.142    18225.713      57470.807
#>   Home Office   13131.324  4199.412    14274.011      28693.932

Most buyers chose the Standard Class for shipping options, with the Consumer segment having the highest number of purchases.

Which category has the highest quantity sold?

xtabs(Quantity~Category,elips_clean)

#> Category
#>       Furniture Office Supplies      Technology 
#>            8028           22906            6939

The Office Supply category has the highest quantity sold.

5 Explanation & Business Recomendation

PT. Elips has been selling products for 4 years, from 2014 to 2017, and during this period, they have sold a total of 37,873 items with a total profit of 286,397. The company sells three main categories of products, namely Furniture, Office Supplies, and Technology. The ‘Technology’ category has generated the highest profit of 145,454.95 compared to the other categories. However, the category that also experienced the highest loss, amounting to 6,599.987, was also ‘Technology’ in 2016. It is worth noting that the ‘Technology’ category has the highest selling price among all categories.

Buyers who have purchased the products are segmented into three categories, namely Consumer, Corporate, and Home Office. Among them, the ‘Consumer’ segment has provided the highest amount of profit compared to others, with a value of 134,119.21. There are four shipping options available, namely First Class, Same Day, Second Class, and Standard Class. Based on the data, most buyers across all segments choose Standard Class as their preferred shipping option. Furthermore, the ‘Consumer’ segment is the highest segment that selects Standard Class mode.

Recommendations:

Our calculations indicate that the Technology category has the highest sales value, but it also has the highest loss due to the excessively high discounts (80%) that were given without considering the profit margin. It is important to remember that no discount should be given without ensuring that the company makes a profit. Therefore, it is recommended that discount percentages be calculated more carefully before being implemented.
To expand our market, we should focus on targeting the Corporate segment and Home Office segment by offering them more discounts, while also ensuring that the company’s profits are maintained.
Furthermore, it is imperative to ensure that all categories, particularly Office Supplies, are securely stocked in the warehouse. According to our calculations, buyers mostly purchased Office Supplies with a total of 22,906, which is almost two to three times higher than any other category.

Analyzing Sales and Customer Preferences in Retail Dataset

Reyhan Bihaqqi Purnawan

27th of April, 2023