This code through explores a sample of shipping data from an international electronics e-commerce company to determine what factors lead to shipments not arriving to their destinations on time.
The dataset used has 10999 observations of 12 variables. The data contains the following variables:
ID: ID Number of Customers. Warehouse block: The Company has a big Warehouse which is divided into blocks such as A,B,C,D,E. Mode of shipment:The Company Ships the products in multiple ways such as by Ship, Flight and Road. Customer care calls: The number of calls made from buyer for inquiry of the shipment. Customer rating: The company rating from every customer. 1 is the lowest (Worst), 5 is the highest (Best). Cost of the product: Cost of the Product in US Dollars. Prior purchases: The Number of Prior Purchases from buyer. Product importance: The company has categorized the product in the various parameter such as low, medium, high. Gender: Male and Female (of buyer). Discount offered: Discount offered on that specific product. Weight in gms: It is the weight in grams. Reached on time: This is the target variable, where 1 Indicates that the product has NOT reached on time and 0 indicates it has reached on time. (I know thats confusing, but i didnt make this data)
Specifically, we’ll explain and demonstrate how to explore the data of a few key variables, and determine if any of the variables in the dataset can help to predict if shipments will be delayed.
This topic is valuable because shipping delays are huge issues for companies because, while most companies use 3rd party shipping companies to move their product, shipping delays tend to negatively impact the seller of the product and impact return customers.
Specifically, you’ll learn how to visually explore your data and run simple regressions.
Here, we’ll show how to load the data as a csv file. using the read.csv function we can insert the path name of our file and use the assignment operator to bring the data into R. make sure not to use single backslashes (double or single forward slashes) or your code will run an error.
data <- read.csv("C:\\Users\\hoffm\\Dropbox\\My PC (DESKTOP-E74M7O6)\\Desktop\\R-class\\Train.csv")Then we will visually inspect the first 10 rows of data by using head()
head(data, n = 10)To start, we get to know this data a little better and understand what is going on with the shipping practices of our business. Lets say we want to know the percentage of packages that dont arrive on time, the average value of packages sent (and cost distribution), and see the breakdown of shipping methods.
Let’s start with determining the percentage of packages that don’t arrive on time. we will use the mean function to do this.
mean(data$Reached.on.Time_Y.N)## [1] 0.5966906
Wow! Almost 60%! Seems like this company has a real issue on their hands. Now Im interested in seeing how the customers rated the company before we move on to our other questions. We can use the mean function again.
mean(data$Customer_rating)## [1] 2.990545
Interesting, basically 3 stars, not great but not awful either. Anyways lets move on. But now you can see how mean() is useful for what is going on (on average) within the business.
We already know how to use the mean function, so lets see the average cost of products sold.
mean(data$Cost_of_the_Product)## [1] 210.1968
So $210.19. But this doesnt tell us as much as a distribution will.
To do this we need to create a histogram. We can do this using base R’s plotting functionality.
hist(data$Cost_of_the_Product)
We can now see that the sales of this company are skewed towards more
expensive products. But this graph could look way better, and we can do
that using the built in functionality of the hist() command to add
color, have better labels, and increase the number of bins (or breaks)
to get a better look at the distribution.
hist(data$Cost_of_the_Product,
breaks = 20,
main= "Cost of Products Sold",
xlab= "Cost",
ylab = "number",
xlim = c(50,350),
ylim = c(0,1000),
col = "darkgreen")However, because shipping methods is non-numeric, we can’t use the hist function to compare shipping methods. That is unless we use the table() function.
head(data$Mode_of_Shipment, n = 20)## [1] "Flight" "Flight" "Flight" "Flight" "Flight" "Flight" "Flight" "Flight"
## [9] "Flight" "Flight" "Flight" "Flight" "Flight" "Flight" "Flight" "Flight"
## [17] "Flight" "Ship" "Ship" "Ship"
class(data$Mode_of_Shipment)## [1] "character"
dat <- data%>%
group_by(Mode_of_Shipment)%>%
summarise(n=n())
mytable <- table(data$Mode_of_Shipment)
barplot((mytable),
main= "Shipping Methods",
xlab= "Method",
ylab = "number",
col = "red")Great! Now we can see that this company sends most of their products via ships (like boats) and visualized it.
Now lets take a look at what might predict if a package arrives late to the customer. we will use a simply linear regresssion to do this.
lm(formula = Reached.on.Time_Y.N ~ Warehouse_block+Mode_of_Shipment+Cost_of_the_Product+Product_importance+Discount_offered+Weight_in_gms, data=data)##
## Call:
## lm(formula = Reached.on.Time_Y.N ~ Warehouse_block + Mode_of_Shipment +
## Cost_of_the_Product + Product_importance + Discount_offered +
## Weight_in_gms, data = data)
##
## Coefficients:
## (Intercept) Warehouse_blockB Warehouse_blockC
## 7.780e-01 1.928e-02 1.080e-02
## Warehouse_blockD Warehouse_blockF Mode_of_ShipmentRoad
## 1.400e-02 9.266e-03 -1.112e-02
## Mode_of_ShipmentShip Cost_of_the_Product Product_importancelow
## -6.298e-03 -4.758e-04 -5.996e-02
## Product_importancemedium Discount_offered Weight_in_gms
## -5.849e-02 1.007e-02 -4.571e-05
Alright. It seems that the shipping methods don’t directly influence the timeliness of shipping for this company very much. However, it seems that certain warehouse blocks and whether a discount on the product was given do slightly influence whether a package will be delayed.
This code through references and cites the following sources: