This data contains sales records including discounts, profits of a retail company that sells Office Supplies, Furniture and Technology goods. The buyers come from various segments such as Corporate, Home Office to individual consumers. This company can also provide the best option for your shipment that suits your needs.
We will analyze the retail.csv data contained in the
data_input folder. Use the read.csv() function
to read the CSV file to R.
retail <- read.csv(file="data_input/retail.csv")
retailafter we have successfully imported our data, we will do a data inspection to find out contents our data, actually we can use the view() function to view the contents of the data but it will take time to see the whole data so we use a function that sees the head() and tail() data’s only.
head(retail)tail(retail)dim(retail)#> [1] 9994 15
The functions anteNA() and is.na() are used
to find out whether there is a missing value in the data, functions that
sound simple but are very crucial if they are not carried out because
they will affect our work process in analysis, for example during
calculations.
anyNA(retail)#> [1] FALSE
colSums(is.na(retail))#> Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
#> 0 0 0 0 0 0
#> Segment Product.ID Category Sub.Category Product.Name Sales
#> 0 0 0 0 0 0
#> Quantity Discount Profit
#> 0 0 0
From the results of our inspections, we get some information that can support our analysis, as follows:
Data cleansing is done to make the data we have appropriate the data we need for analysis purposes.
str(retail)#> 'data.frame': 9994 obs. of 15 variables:
#> $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
#> $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
#> $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#> $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
From this result, we find some of data type not in the corect type. we need to convert it into corect type (data corection)
retail$Row.ID <- as.character(retail$Row.ID)
retail$Segment <- as.factor(retail$Segment)
retail$Ship.Mode <- as.factor(retail$Ship.Mode)
retail$Category <- as.factor(retail$Category)
retail$Sub.Category <- as.factor(retail$Sub.Category)
str(retail)#> 'data.frame': 9994 obs. of 15 variables:
#> $ Row.ID : chr "1" "2" "3" "4" ...
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
#> $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
#> $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#> $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
there are still inaccurate data types, namely Order.Date
and Ship.Date, we will change it by using the
lubridate library.
library(lubridate)
retail$Order.Date <- mdy(retail$Order.Date)
retail$Ship.Date <- mdy(retail$Ship.Date)str(retail)#> 'data.frame': 9994 obs. of 15 variables:
#> $ Row.ID : chr "1" "2" "3" "4" ...
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
#> $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
#> $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#> $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Now, Each of column already changed into desired data type dataset is ready to be processed and analyzed.
summary(retail)#> Row.ID Order.ID Order.Date
#> Length:9994 Length:9994 Min. :2014-01-03
#> Class :character Class :character 1st Qu.:2015-05-23
#> Mode :character Mode :character Median :2016-06-26
#> Mean :2016-04-30
#> 3rd Qu.:2017-05-14
#> Max. :2017-12-30
#>
#> Ship.Date Ship.Mode Customer.ID
#> Min. :2014-01-07 First Class :1538 Length:9994
#> 1st Qu.:2015-05-27 Same Day : 543 Class :character
#> Median :2016-06-29 Second Class :1945 Mode :character
#> Mean :2016-05-03 Standard Class:5968
#> 3rd Qu.:2017-05-18
#> Max. :2018-01-05
#>
#> Segment Product.ID Category
#> Consumer :5191 Length:9994 Furniture :2121
#> Corporate :3020 Class :character Office Supplies:6026
#> Home Office:1783 Mode :character Technology :1847
#>
#>
#>
#>
#> Sub.Category Product.Name Sales Quantity
#> Binders :1523 Length:9994 Min. : 0.444 Min. : 1.00
#> Paper :1370 Class :character 1st Qu.: 17.280 1st Qu.: 2.00
#> Furnishings: 957 Mode :character Median : 54.490 Median : 3.00
#> Phones : 889 Mean : 229.858 Mean : 3.79
#> Storage : 846 3rd Qu.: 209.940 3rd Qu.: 5.00
#> Art : 796 Max. :22638.480 Max. :14.00
#> (Other) :3613
#> Discount Profit
#> Min. :0.0000 Min. :-6599.978
#> 1st Qu.:0.0000 1st Qu.: 1.729
#> Median :0.2000 Median : 8.666
#> Mean :0.1562 Mean : 28.657
#> 3rd Qu.:0.2000 3rd Qu.: 29.364
#> Max. :0.8000 Max. : 8399.976
#>
Summary :
Check the Outlier within profit
boxplot(retail$Profit)From result above, we find posibilities for the outliers, but we need to see or calculate the sd value so that we can tolerate our data or not.
aggregate(Profit~Segment,retail,sd)From the result our Sd all segment is around 200 my oppinion its still be tolerated, so the process may continue.
our company wants to evaluate sales results by segment to plan in the following period in order to increase our sales
sales_agg <- aggregate(formula=cbind(Sales,Profit)~Segment, data=retail, FUN=sum)
sales_agg[order(sales_agg$Sales, decreasing = T),]The aggregation results show that our sales and profits for 4 years are quite good in the Consumer segment compared to other segments during these 4 years.
freq_retail <- table(retail$Category, retail$Segment)
freq_retail#>
#> Consumer Corporate Home Office
#> Furniture 1113 646 362
#> Office Supplies 3127 1820 1079
#> Technology 951 554 342
Category Office Supplies were the most purchased in each segment, indicating that Office Supplies were the most needed or in demand during the past 4 years.
office_supplies <- retail[retail$Category=="Office Supplies",]
qty_officesupplies <- aggregate(formula=Quantity~Sub.Category, data=office_supplies, FUN=sum)
qty_officesupplies[order(qty_officesupplies$Quantity, decreasing = T),]from our calculations on the category office supplies subcategory the most interested are Binders and the least interested or purchased are Supplies.
Technology <- retail[retail$Category=="Technology",]
qty_techonlogy <- aggregate(formula=Quantity~Sub.Category, data=Technology, FUN=sum)
qty_techonlogy[order(qty_techonlogy$Quantity, decreasing = F),]These results show that the lowest sales in technology are Copiers, in my opinion, this is because photocopy machines in this era have slightly reduced their use, where currently we use digital data more often that can be directly duplicated digitally, and there are several the movement to reduce paper use made hardcopy duplication very rare.
retail$year_order <- format(retail$Order.Date,"%Y")
retail$yearmonth_order <- format(retail$Order.Date,"%Y-%m")
head(retail)year_tot <- aggregate(formula=Quantity~year_order, data=retail, FUN=sum)
year_tot[order(year_tot$Quantity, decreasing = T),]In 2017 there was a rapid increase compared to the previous year, from this maybe we can replicate what we did in 2017 the next period.
year_avg <- aggregate(formula=Profit~year_order, data=retail, FUN=mean)
year_avg[order(year_avg$Profit, decreasing = T),]despite in 2017 there was the highest order quantity, but if seen from the average profit in 2017 it was only in the 3rd position, my assumption this is because there is an outlier profit in 2017.
library(ggplot2)
month_avg <- aggregate(formula=Profit~yearmonth_order, data=retail[retail$year_order == 2017,], FUN=sum)
ggplot(month_avg, aes(x = yearmonth_order, y = Profit)) +
geom_line(group = 1) +
geom_point()Referring to statement number 5, it was proven that there was a high spike in certain months that occurred in 2017-03, this happened because there were certain events in that month that caused high demand.
shipping <- aggregate(formula=Order.ID~Ship.Mode, data=retail, FUN=length)
shippingStandard class is the customer’s favorite shipping class, in my opinion it’s because price given is cheaper than other shipping modes.
retail$Duration <- difftime(time1=retail$Ship.Date, time2 = retail$Order.Date, units = "days")
shipping_duration <- aggregate(formula=Duration~Ship.Mode, data=retail, FUN=median)
shipping_durationWe will use this median value as a standard reference for shipping each ship mode
nrow(retail[retail$Ship.Mode =="First Class" & retail$Duration > 2,c("Ship.Mode","Duration")])#> [1] 624
There are still 624 orders that are still below the time we specify.
nrow(retail[retail$Ship.Mode =="Same Day" & retail$Duration > 0,c("Ship.Mode","Duration")])#> [1] 24
There are still 24 orders that are still below the time we specify.
nrow(retail[retail$Ship.Mode =="Second Class" & retail$Duration > 3,c("Ship.Mode","Duration")])#> [1] 800
There are still 800 orders that are still below the time we specify.
nrow(retail[retail$Ship.Mode =="Standard Class" & retail$Duration > 5,c("Ship.Mode","Duration")])#> [1] 1824
There are still 1824 orders that are still below the time we specify.
sum(retail$Quantity)#> [1] 37873
Answer : 37873 items
sum(retail$Profit)#> [1] 286397
Answer : 286397 items
discount <- aggregate(formula=cbind(Quantity,Profit)~year_order, data=retail[retail$Discount>0,], FUN=sum)
discountFrom this result, the discount that we give to consumers is less effective because if we look at the total profit, we see a loss every year.
no_discount <- aggregate(formula=cbind(Quantity,Profit)~year_order, data=retail[retail$Discount==0,], FUN=sum)
no_discountrefers to statement no.14 is proven that the discount that we make is less effective if we look at the total quantity sold that is not much different, even though we are profitable in the absence of a discount, we actually generate a large profit.
In 4 years, since bigining 2014-2017 Our Retail has sold a total of 37873 items with a total profit of 286397. The company sells 3 main categories: Furniture, Office Supplies and Technology.
of all categories Office Supplies provide the biggest profit from each segment, the most desirable sub profit is Binders. Unfortunately, in the Technology category, sales are the smallest in each segment, in my opinion, this could be due to the nature of our consumers who buy technology for long-term use, and because in the Copiers subcategory their use has decreased slightly due to the shift in digital document duplication.
In 2017 there was a rapid increase sales compared to the previous year, despite in 2017 there was the highest order quantity, but if seen from the average profit in 2017 it was only in the 3rd position, my assumption this is because there is an outlier profit in 2017 or certain events in that month that caused high demand.
Standard class is the customer’s favorite shipping class, in my opinion it’s because price given is cheaper than other shipping modes. Unfortunately, there are still many delays in each shipping method that we provide if we refer to the standards that we set on above.
If we look at the results of our calculations, discounts are not very effective because the total number of sales with or without discounts does not have a significant difference, it actually gives a loss to our company.