Today, we will be analyzing the retail dataset from PT. Elips, a global retail company that sells office supplies, furniture, and technology items. We will be exploring the data to gain insights into the company’s sales performance, profit and loss, and customer preferences across different segments. We hope you enjoy the analysis and find the insights useful in informing future business decisions.
To begin our exploratory data analysis of the retail dataset, we need to load the data into R. The dataset is stored in a CSV file named ‘retail.csv’, which we have saved in a folder called ‘data_input’. To load the dataset into R, we have used the ‘read.csv’ function, which reads the CSV file and stores the data in a data frame.
<-read.csv("data_input/retail.csv") elips
In our code, we have created a new data frame called ‘elips’ and used the ‘read.csv’ function to read in the ‘retail.csv’ file from the ‘data_input’ folder. This folder path may vary depending on where you have saved your data file. With the retail data now loaded into R, we can proceed with our exploratory analysis to gain insights into sales performance and customer preferences.
After loading the retail dataset into R, we’ll start by inspecting the data to get a better understanding of its structure and content. To do this, we’ve used several functions in R.
Firstly, we used the ‘head’ function to display the first six rows of the data frame ‘elips’, which gives us a quick glimpse of the data.
head(elips)
Next, we used the ‘tail’ function to display the last six rows of the data frame, which is useful for checking if the data has been read in correctly
tail(elips)
We then used the ‘dim’ function to show the dimensions of the data frame, which tells us how many rows and columns are in the data.
dim(elips)
#> [1] 9994 15
Finally, we used the ‘names’ function to display the column names of the data frame, which is helpful for identifying the variables that are present in the dataset.
names(elips)
#> [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date" "Ship.Mode"
#> [6] "Customer.ID" "Segment" "Product.ID" "Category" "Sub.Category"
#> [11] "Product.Name" "Sales" "Quantity" "Discount" "Profit"
Through our inspection of the ‘elips’ retail dataset, we can gather some important information about the data structure and content.
elips data contain 9994 of rows and 15 of coloumns
Each of column name : “Row.ID”, “Order.ID”, “Order.Date”, “Ship.Date”, “Ship.Mode”, “Customer.ID”, “Segment”,“Product.ID”, “Category”, “Sub.Category”, “Product.Name”, “Sales”, “Quantity”, “Discount”, “Profit”
After inspecting the ‘elips’ retail dataset, we used the ‘str’ function to check the data types of each column.
str(elips)
#> 'data.frame': 9994 obs. of 15 variables:
#> $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
#> $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
#> $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#> $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
We found that some columns are not in the correct data type, and we need to convert them to the correct data type using data coercion.
$Row.ID <- as.character(elips$Row.ID)
elips$Order.ID<- as.character (elips$Order.ID)
elips$Order.Date<- as.Date(elips$Order.Date, "%m/%d/%y")
elips$Ship.Date <- as.Date(elips$Ship.Date, "%m/%d/%y")
elips$Customer.ID<-as.character(elips$Customer.ID)
elips$Product.ID<- as.character(elips$Product.ID)
elips$Product.Name<-as.character(elips$Product.Name)
elips
str(elips)
#> 'data.frame': 9994 obs. of 15 variables:
#> $ Row.ID : chr "1" "2" "3" "4" ...
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
#> $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
#> $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#> $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
After performing data coercion on the ‘elips’ retail dataset, we successfully converted each column to the desired data type.
To ensure data quality, we need to check if the ‘elips’ retail dataset contains any missing values.
colSums(is.na(elips))
#> Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
#> 0 0 0 0 0 0
#> Segment Product.ID Category Sub.Category Product.Name Sales
#> 0 0 0 0 0 0
#> Quantity Discount Profit
#> 0 0 0
anyNA(elips)
#> [1] FALSE
We are glad to report that there are no missing values present in the dataset. This means that our analysis can proceed with complete data.
We used subsetting to remove columns 1 and 2 from the ‘elips’ retail dataset because they are not needed for our analysis. We saved the subsetted dataset as the ‘elips_clean’ variable. This variable now contains the relevant columns that we will use for our exploratory data analysis.
<- elips[,c(3:15)]
elips_cleanhead(elips_clean)
Now that we have completed the data cleansing and subsetting steps, the ‘Elips’ dataset is ready to be processed and analyzed.
We used the summary(elips_clean) function to obtain summary statistics for the ‘Elips’ retail dataset.
summary(elips_clean)
#> Order.Date Ship.Date Ship.Mode
#> Min. :2014-01-03 Min. :2014-01-07 Length:9994
#> 1st Qu.:2015-05-23 1st Qu.:2015-05-27 Class :character
#> Median :2016-06-26 Median :2016-06-29 Mode :character
#> Mean :2016-04-30 Mean :2016-05-03
#> 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
#> Max. :2017-12-30 Max. :2018-01-05
#> Customer.ID Segment Product.ID Category
#> Length:9994 Length:9994 Length:9994 Length:9994
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#> Sub.Category Product.Name Sales Quantity
#> Length:9994 Length:9994 Min. : 0.444 Min. : 1.00
#> Class :character Class :character 1st Qu.: 17.280 1st Qu.: 2.00
#> Mode :character Mode :character Median : 54.490 Median : 3.00
#> Mean : 229.858 Mean : 3.79
#> 3rd Qu.: 209.940 3rd Qu.: 5.00
#> Max. :22638.480 Max. :14.00
#> Discount Profit
#> Min. :0.0000 Min. :-6599.978
#> 1st Qu.:0.0000 1st Qu.: 1.729
#> Median :0.2000 Median : 8.666
#> Mean :0.1562 Mean : 28.657
#> 3rd Qu.:0.2000 3rd Qu.: 29.364
#> Max. :0.8000 Max. : 8399.976
The summary of Elips dataset shows several insights.
The initial occurrence of the first order was in January 2014.
Standard Class was the most favored shipping option, while Same-day delivery was the least popular.
PT. Elips has the highest number of buyers from the consumer segment, and the home office segment has the lowest.
The Office Supply category was the top-performing category compared to others.
In terms of subcategories, Other stuff was the most preferred, followed by Binders and Paper. Art had the lowest preference.
PT. Elips achieved an average sales value of 229.858, with the highest sales value at 22638.480 and the lowest at 0.444.
The maximum quantity ordered within four years was 14, with an average of 3.79.
The maximum discount offered by PT. Elips was 80%, but the average discount offered was around 15%.
PT. Elips experienced a loss of 6599 and a profit of 8399, with an average profit of 28.657.
As a preliminary step in our exploratory data analysis, we checked for any outliers in the ‘Profit’ column of the ‘Elips’ dataset. Outliers are data points that deviate significantly from the rest of the distribution and can skew our analysis results.
aggregate(Profit~Segment,elips_clean,mean)
aggregate(Profit~Segment,elips_clean,var)
aggregate(Profit~Segment,elips_clean,sd)
boxplot(elips_clean$Profit)
After checking for potential outliers in the ‘Profit’ column of the ‘Elips’ dataset, we found several data points that fell outside the whiskers of the boxplot, indicating possible outliers. However, upon further investigation, we observed that the standard deviation value for the ‘Profit’ column was around 200, which is still within an acceptable range for the dataset.
$Sales == 0.444,] elips[elips
The category with the lowest sales figure is “Office Supplies,” with a sales value of 0.444 coming from the “Consumer” segment. This category resulted in a loss of 1.11.
<- elips_clean[elips_clean$Category=="Office Supplies",]
os round(prop.table(table(os$Segment))*100,2)
#>
#> Consumer Corporate Home Office
#> 51.89 30.20 17.91
The Consumer segment has the highest percentage at 51.89%.
<-elips_clean[elips_clean$Discount==0.8,]
dissort(table(dis$Order.Date),decreasing = T)[1]) (
#> 2014-07-26
#> 4
The maximum discount of 80% occurred in July 2014, and it happened four times.
<-dis[dis$Order.Date=="2014-07-26",]
dis2table(dis2$Segment)) (
#>
#> Consumer
#> 4
The entire buyers are from the Consumer segment.
xtabs(Quantity~Segment+Category,elips_clean)
#> Category
#> Segment Furniture Office Supplies Technology
#> Consumer 4166 11758 3597
#> Corporate 2495 7018 2095
#> Home Office 1367 4130 1247
plot(xtabs(Quantity~Segment+Category,elips_clean))
heatmap(xtabs(Quantity~Segment+Category, elips_clean),
cexRow = 0.8, cexCol = 0.8, scale = "column", Colv = NA, Rowv = NA)
Based on the previously mentioned results, the Consumer segment with the Office Supply category has the highest quantity ordered with a total of 11,758.
aggregate(Quantity~Segment+Category,elips_clean,median)
The median quantity for each segment and category is 3.
sum(elips_clean$Quantity)
#> [1] 37873
The total quantity ordered since PT. Elips started selling the product is 37,873 items.
sum(elips_clean$Profit)
#> [1] 286397
The total profit generated is 286397.
<- elips_clean[elips_clean$Discount==0,]
nol::kable(table(nol$Discount)) knitr
Var1 | Freq |
---|---|
0 | 4798 |
<- elips_clean[elips_clean$Discount>0 & elips_clean$Discount<=0.8,]
more0 ::kable(table(more0$Discount)) knitr
Var1 | Freq |
---|---|
0.1 | 94 |
0.15 | 52 |
0.2 | 3657 |
0.3 | 227 |
0.32 | 27 |
0.4 | 206 |
0.45 | 11 |
0.5 | 66 |
0.6 | 138 |
0.7 | 418 |
0.8 | 300 |
Based on this data, 50% of items were sold without any discount given.
xtabs(Profit~ Category, elips_clean)
#> Category
#> Furniture Office Supplies Technology
#> 18451.27 122490.80 145454.95
The Technology category generates the highest profit.
$Profit==-6599.978,] elips[elips
The highest loss occurred in 2016 when five machine items were sold at a 70% discount.
xtabs(Profit~ Segment, elips_clean)
#> Segment
#> Consumer Corporate Home Office
#> 134119.21 91979.13 60298.68
The Consumer segment had the highest profit compared to other segments.
xtabs(Sales~ Category, elips_clean)
#> Category
#> Furniture Office Supplies Technology
#> 741999.8 719047.0 836154.0
plot(xtabs(Profit~Segment+Ship.Mode,elips_clean))
The Technology category has the highest sales figure compared to others.
xtabs(Profit~Segment+Ship.Mode,elips_clean)
#> Ship.Mode
#> Segment First Class Same Day Second Class Standard Class
#> Consumer 21374.044 9874.205 24946.911 77924.049
#> Corporate 14464.472 1818.142 18225.713 57470.807
#> Home Office 13131.324 4199.412 14274.011 28693.932
Most buyers chose the Standard Class for shipping options, with the Consumer segment having the highest number of purchases.
xtabs(Quantity~Category,elips_clean)
#> Category
#> Furniture Office Supplies Technology
#> 8028 22906 6939
The Office Supply category has the highest quantity sold.
PT. Elips has been selling products for 4 years, from 2014 to 2017, and during this period, they have sold a total of 37,873 items with a total profit of 286,397. The company sells three main categories of products, namely Furniture, Office Supplies, and Technology. The ‘Technology’ category has generated the highest profit of 145,454.95 compared to the other categories. However, the category that also experienced the highest loss, amounting to 6,599.987, was also ‘Technology’ in 2016. It is worth noting that the ‘Technology’ category has the highest selling price among all categories.
Buyers who have purchased the products are segmented into three categories, namely Consumer, Corporate, and Home Office. Among them, the ‘Consumer’ segment has provided the highest amount of profit compared to others, with a value of 134,119.21. There are four shipping options available, namely First Class, Same Day, Second Class, and Standard Class. Based on the data, most buyers across all segments choose Standard Class as their preferred shipping option. Furthermore, the ‘Consumer’ segment is the highest segment that selects Standard Class mode.
Recommendations:
Our calculations indicate that the Technology category has the highest sales value, but it also has the highest loss due to the excessively high discounts (80%) that were given without considering the profit margin. It is important to remember that no discount should be given without ensuring that the company makes a profit. Therefore, it is recommended that discount percentages be calculated more carefully before being implemented.
To expand our market, we should focus on targeting the Corporate segment and Home Office segment by offering them more discounts, while also ensuring that the company’s profits are maintained.
Furthermore, it is imperative to ensure that all categories, particularly Office Supplies, are securely stocked in the warehouse. According to our calculations, buyers mostly purchased Office Supplies with a total of 22,906, which is almost two to three times higher than any other category.