Introduction

This code through explores a sample of shipping data from an international electronics e-commerce company to determine what factors lead to shipments not arriving to their destinations on time.

The dataset used has 10999 observations of 12 variables. The data contains the following variables:

ID: ID Number of Customers. Warehouse block: The Company has a big Warehouse which is divided into blocks such as A,B,C,D,E. Mode of shipment:The Company Ships the products in multiple ways such as by Ship, Flight and Road. Customer care calls: The number of calls made from buyer for inquiry of the shipment. Customer rating: The company rating from every customer. 1 is the lowest (Worst), 5 is the highest (Best). Cost of the product: Cost of the Product in US Dollars. Prior purchases: The Number of Prior Purchases from buyer. Product importance: The company has categorized the product in the various parameter such as low, medium, high. Gender: Male and Female (of buyer). Discount offered: Discount offered on that specific product. Weight in gms: It is the weight in grams. Reached on time: This is the target variable, where 1 Indicates that the product has NOT reached on time and 0 indicates it has reached on time. (I know thats confusing, but i didnt make this data)


Content Overview

Specifically, we’ll explain and demonstrate how to explore the data of a few key variables, and determine if any of the variables in the dataset can help to predict if shipments will be delayed.


Why You Should Care

This topic is valuable because shipping delays are huge issues for companies because, while most companies use 3rd party shipping companies to move their product, shipping delays tend to negatively impact the seller of the product and impact return customers.


Learning Objectives

Specifically, you’ll learn how to visually explore your data and run simple regressions.



Loading data

Here, we’ll show how to load the data as a csv file. using the read.csv function we can insert the path name of our file and use the assignment operator to bring the data into R. make sure not to use single backslashes (double or single forward slashes) or your code will run an error.

data <- read.csv("C:\\Users\\hoffm\\Dropbox\\My PC (DESKTOP-E74M7O6)\\Desktop\\R-class\\Train.csv")


Visual Inspection

Then we will visually inspect the first 10 rows of data by using head()

head(data, n = 10)


Understanding our data

To start, we get to know this data a little better and understand what is going on with the shipping practices of our business. Lets say we want to know the percentage of packages that dont arrive on time, the average value of packages sent (and cost distribution), and see the breakdown of shipping methods.

Let’s start with determining the percentage of packages that don’t arrive on time. we will use the mean function to do this.

mean(data$Reached.on.Time_Y.N)
## [1] 0.5966906

Wow! Almost 60%! Seems like this company has a real issue on their hands. Now Im interested in seeing how the customers rated the company before we move on to our other questions. We can use the mean function again.

mean(data$Customer_rating)
## [1] 2.990545

Interesting, basically 3 stars, not great but not awful either. Anyways lets move on. But now you can see how mean() is useful for what is going on (on average) within the business.


Average Cost and Cost Distribution

We already know how to use the mean function, so lets see the average cost of products sold.

mean(data$Cost_of_the_Product)
## [1] 210.1968

So $210.19. But this doesnt tell us as much as a distribution will.

To do this we need to create a histogram. We can do this using base R’s plotting functionality.

hist(data$Cost_of_the_Product)

We can now see that the sales of this company are skewed towards more expensive products. But this graph could look way better, and we can do that using the built in functionality of the hist() command to add color, have better labels, and increase the number of bins (or breaks) to get a better look at the distribution.

hist(data$Cost_of_the_Product,
breaks = 20,
main= "Cost of Products Sold",
xlab= "Cost",
ylab = "number",
xlim = c(50,350),
ylim = c(0,1000),
col = "darkgreen")


Shipping Methods

However, because shipping methods is non-numeric, we can’t use the hist function to compare shipping methods. That is unless we use the table() function.

head(data$Mode_of_Shipment, n = 20)
##  [1] "Flight" "Flight" "Flight" "Flight" "Flight" "Flight" "Flight" "Flight"
##  [9] "Flight" "Flight" "Flight" "Flight" "Flight" "Flight" "Flight" "Flight"
## [17] "Flight" "Ship"   "Ship"   "Ship"
class(data$Mode_of_Shipment)
## [1] "character"
dat <- data%>%
  group_by(Mode_of_Shipment)%>%
  summarise(n=n())
 

mytable <- table(data$Mode_of_Shipment) 

barplot((mytable),
main= "Shipping Methods",
xlab= "Method",
ylab = "number",
col = "red")

Great! Now we can see that this company sends most of their products via ships (like boats) and visualized it.


Prediction

Now lets take a look at what might predict if a package arrives late to the customer. we will use a simply linear regresssion to do this.

lm(formula = Reached.on.Time_Y.N ~ Warehouse_block+Mode_of_Shipment+Cost_of_the_Product+Product_importance+Discount_offered+Weight_in_gms, data=data)
## 
## Call:
## lm(formula = Reached.on.Time_Y.N ~ Warehouse_block + Mode_of_Shipment + 
##     Cost_of_the_Product + Product_importance + Discount_offered + 
##     Weight_in_gms, data = data)
## 
## Coefficients:
##              (Intercept)          Warehouse_blockB          Warehouse_blockC  
##                7.780e-01                 1.928e-02                 1.080e-02  
##         Warehouse_blockD          Warehouse_blockF      Mode_of_ShipmentRoad  
##                1.400e-02                 9.266e-03                -1.112e-02  
##     Mode_of_ShipmentShip       Cost_of_the_Product     Product_importancelow  
##               -6.298e-03                -4.758e-04                -5.996e-02  
## Product_importancemedium          Discount_offered             Weight_in_gms  
##               -5.849e-02                 1.007e-02                -4.571e-05

Alright. It seems that the shipping methods don’t directly influence the timeliness of shipping for this company very much. However, it seems that certain warehouse blocks and whether a discount on the product was given do slightly influence whether a package will be delayed.


Further Resources

Learn more about dataset with the following:




Works Cited

This code through references and cites the following sources: