This is an R Markdown document that contains business recomendations based on retail data provided by PT. ABC’s Business Database. This Rmarkdown will describe :
This is retail dataset obtained from PT. ABC’s Business Database. The dataset includes their retail transactions in the past 4 years. Today, successful retail businesses rely on solid facts, metrics, and up-to-date data to make key business decisions. That is the reason why we need to analyze this data. If you want to maximize profits and gain customer loyalty then you must understand your customers better.
First make sure the data we’ll like to work with is also in our current directory, and use the read.csv() to read our csv file into our global environment.
We can use str() to inspect the structure of data. Call str() on our retail dataset. str() returns the structure of an R Object.
## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : Factor w/ 5009 levels "CA-2014-100006",..: 2501 2501 2297 4373 4373 202 202 202 202 202 ...
## $ Order.Date : Factor w/ 1237 levels "1/1/17","1/10/14",..: 305 305 836 94 94 922 922 922 922 922 ...
## $ Ship.Date : Factor w/ 1334 levels "1/1/15","1/1/16",..: 220 220 907 129 129 897 897 897 897 897 ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
## $ Customer.ID : Factor w/ 793 levels "AA-10315","AA-10375",..: 144 144 240 706 706 89 89 89 89 89 ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Product.ID : Factor w/ 1862 levels "FUR-BO-10000112",..: 13 56 947 320 1317 186 563 1762 795 438 ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: Factor w/ 1850 levels "\"While you Were Out\" Message Book, One Form per Page",..: 387 833 1440 367 574 570 1137 1099 535 295 ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
To learn the data we have, we can see our data attributes like names, length, colnames, dim etc:
## [1] 15
For small dataset, we can view the full data in its raw format through the View(retail) command. But, it is not recommended to use the View() command, because in real life we don’t always know beforehand the size of data.
We can take a peek of our data by using the head or tail command. And we can pass in an extra argument, n, so the function would return the first/last n number of rows instead of the default 6.
We can also inspect if there is any NA (not available) or missing value from the retail dataset by using anyNA() command or colSums(is.na()) for coloumn checking.
## [1] FALSE
After Reading Data into R, we can see the information from raw dataset:
We can simplify the data we need to do a good analysis. First we can exclude the data that we don’t need, such as Row.ID and Order.ID.
As we have seen in the results of str() command, we have to transform some types of data variables. We could explicitly coerce one class to another:
retail$Order.Date <- as.Date(retail$Order.Date, "%m/%d/%y")
retail$Ship.Date <- as.Date(retail$Ship.Date, "%m/%d/%y")
retail[,c("Customer.ID","Product.ID","Product.Name")] <-
lapply(retail[,c("Customer.ID","Product.ID","Product.Name")], as.character)
#using "lapply()" command to convert/transform variables into one same type of variable
str(retail)## 'data.frame': 9994 obs. of 13 variables:
## $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
## $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
We can also remove the Product.ID and Discount variables by using subset() fucntion.
#another method to subset/exclude our data
retail <- subset(retail, select = - c(Product.ID, Discount))
str(retail) #take a review from our new retail dataset structure## 'data.frame': 9994 obs. of 11 variables:
## $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
## $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
After we finish tidying up the data we need, we can summary data by using summary() function. This summary() function returns quick summary statistics on each of the variable in our new dataset.
## Order.Date Ship.Date Ship.Mode
## Min. :2014-01-03 Min. :2014-01-07 First Class :1538
## 1st Qu.:2015-05-23 1st Qu.:2015-05-27 Same Day : 543
## Median :2016-06-26 Median :2016-06-29 Second Class :1945
## Mean :2016-04-30 Mean :2016-05-03 Standard Class:5968
## 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
## Max. :2017-12-30 Max. :2018-01-05
##
## Customer.ID Segment Category
## Length:9994 Consumer :5191 Furniture :2121
## Class :character Corporate :3020 Office Supplies:6026
## Mode :character Home Office:1783 Technology :1847
##
##
##
##
## Sub.Category Product.Name Sales Quantity
## Binders :1523 Length:9994 Min. : 0.444 Min. : 1.00
## Paper :1370 Class :character 1st Qu.: 17.280 1st Qu.: 2.00
## Furnishings: 957 Mode :character Median : 54.490 Median : 3.00
## Phones : 889 Mean : 229.858 Mean : 3.79
## Storage : 846 3rd Qu.: 209.940 3rd Qu.: 5.00
## Art : 796 Max. :22638.480 Max. :14.00
## (Other) :3613
## Profit
## Min. :-6599.978
## 1st Qu.: 1.729
## Median : 8.666
## Mean : 28.657
## 3rd Qu.: 29.364
## Max. : 8399.976
##
We can see a lot of information generated by summary() function:
And then we can see which category has the highest sales point and which one has the highest profit by using xtabs() function.
## Category
## Furniture Office Supplies Technology
## 18451.27 122490.80 145454.95
## Category
## Furniture Office Supplies Technology
## 741999.8 719047.0 836154.0
Based on the information from xtabs category tabulation on profits and sales, we can see even though Furniture Office Category sales is greater than Office Supplies, the profit made by Furniture Office is smaller than Office Supplies. We can presume that there is something wrong in the sales of Furniture product category. So, we can analyze the profit made by each Sub.Category as follows:
## Sub.Category
## Category Accessories Appliances Art Binders
## Furniture 0.0000 0.0000 0.0000 0.0000
## Office Supplies 0.0000 18138.0054 6527.7870 30221.7633
## Technology 41936.6357 0.0000 0.0000 0.0000
## Sub.Category
## Category Bookcases Chairs Copiers Envelopes
## Furniture -3472.5560 26590.1663 0.0000 0.0000
## Office Supplies 0.0000 0.0000 0.0000 6964.1767
## Technology 0.0000 0.0000 55617.8249 0.0000
## Sub.Category
## Category Fasteners Furnishings Labels Machines
## Furniture 0.0000 13059.1436 0.0000 0.0000
## Office Supplies 949.5182 0.0000 5546.2540 0.0000
## Technology 0.0000 0.0000 0.0000 3384.7569
## Sub.Category
## Category Paper Phones Storage Supplies
## Furniture 0.0000 0.0000 0.0000 0.0000
## Office Supplies 34053.5693 0.0000 21278.8264 -1189.0995
## Technology 0.0000 44515.7306 0.0000 0.0000
## Sub.Category
## Category Tables
## Furniture -17725.4811
## Office Supplies 0.0000
## Technology 0.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1665.052 -42.631 4.133 -15.230 40.054 1013.127
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1862.31 -122.83 -31.37 -55.57 33.85 629.01
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1049.3406 0.4371 3.9593 -6.2584 9.0858 327.5060
We can see several number unprofitable Sub.Category of all transactions on Furniture category, such as :
And then we can analyze transactions that occur in those categories by subsetting the dataframe.
retail.Fur <- retail[retail$Category == "Furniture",]
retail.Off <- retail[retail$Category == "Office Supplies",]
retail.Tech <- retail[retail$Category == "Technology",]Before we analyze the transactions contained in each new dataset we have, I want to show another method that is more flexible for aggregation. We can try to ensure that transactions that occur in the Furniture and Office Supplies categories made losses.
With this aggregate() function we can also get the same information, Bookcases, Tables and Supplies sub.categories provide some losses on most transactions.
Next we want to observe at the distribution of transactions that exist in each category against our segments market.
##
## Consumer Corporate Home Office
## 0.5247525 0.3045733 0.1706742
##
## Consumer Corporate Home Office
## 0.5189180 0.3020246 0.1790574
##
## Consumer Corporate Home Office
## 0.5247525 0.3045733 0.1706742
By using those functions we can conclude that the transactions distribution from all three Categories are distributed fairly similiar, in which about 52% come from the Consumer market 30% from the Corporate market, and 18% from the Home Office market.
##
## Furniture Office Supplies Technology
## 0.2122273 0.6029618 0.1848109
What is more interesting is that the most profits from 4 years come from the Technology category. From what we can obtain from our summary, the Technology (18.5%) category has the lowest frequency of transactions compared to Furniture (21.2%) and Office Supplies (60.3%).
We can conclude Technology category is very potential for the retail business of PT ABC. With a relatively lowest frequency of transactions compared to others, Technology category product can make the highest profit 145454.95. And we must reconsider to make transaction in Bookcases, Supplies and Tables sub.category which made some losses.