In supply chains we often are hit with massive amounts of data. Some of it is structured, unstructured, in paper form, digital form and sometimes spread out over multiple databases. Regardless of what form it is, we are at some point going to have to deal with strings. Strings are a data-type that are used in programming languages to represent text, much like integers and floats represent numeric data. There are a few reasons why its important to know how to work with strings when it comes to supply chain analytics, but my main focus now will be on volume.
##Volume and strings: How it helps
Supply chains have massive amounts of measurements taken at differing stages of the chain. Some times those stages are simple, 3 to 5 stages (incoming/outgoing products with returns incoming/outgoing), sometimes it can go into hundreds (international shipping with multiple transits/warehousing along the way). When dealing with large amounts of variables we need to be able to filter out those variables we want from those we don’t want. Here is where strings come into play
The name of every column of a data frame can be treated as a string in R. This makes extracting and filtering variables very easy. In this demonstration I will be working with 3 packages in R, the tidyverse, stringr and rebus. I will link to a great rpubs I found that helped me a great deal in learning string data manipulation, but for now lets look at the analysis and explore the power of R.
For more detailed information about the data set please look here. For now lets look at the variables.
head(data[1:7,1:7])## nr i1_legid i1_rcs_p i1_rcs_e i1_dep_1_p i1_dep_1_e i1_dep_1_place
## 1 0 5182 199 218 210 215 609
## 2 1 6523 844 584 90 297 700
## 3 2 5878 4380 4119 90 280 456
## 4 3 1275 759 169 240 777 173
## 5 4 8117 1597 1485 150 241 411
## 6 5 9889 181 98 240 259 815
In this dataset we have 98 variables and 3942 observations, which is a lot to deal with. In the documentation we find out that variables ending with _p are planed times for each stage and variables ending with _e are real time completion of each stage. What if I want to work only with real time variables to find choke points, or calculate delay time for each stage. Well, stringr and rebus can help us there. We know we want to separate the Ps from the Es, but we don’t want stages interfering with one another. Again the documentation tells us that i1 is stage 1, i2 stage 2 and i3 is stage 3 in incoming logistics. We could manually go through the dataset and index all the variables like this.
#leg1.actual <- data[,c(4,6,9,12,15,18,21,24)]
#leg2.actual <- data[,c(28,30,33,36,39,42,45,48)]
#leg3.actual <- data[,c(52,54,57,60,63,66,69,72)]But this is very annoying, hard to do (lots of trial and error) and very likely we might end up with the wrong variables in the data.
Lets do this the easy way, let me first convert the unit measurements of minuets to hours so its easier to interpret the data.
data <- data/60Next lets find all real time completion times and save them to their own data frame. To do this I use the str_subset function since I am extracting the names. Next I indicate from where I am extracting them from, column-names of data, next I use rebus to indicate the pattern, that being _e at the end of each column-name. Once I have those names stored in a vector I can use them to subset my dataset using that vector like I do here down below.
real.time.names <- str_subset(colnames(data), pattern = "_e"%R%END)
real.time.total <- data[,real.time.names]
head(real.time.total[1:7,1:7])## i1_rcs_e i1_dep_1_e i1_rcf_1_e i1_dep_2_e i1_rcf_2_e i1_dep_3_e
## 1 3.633333 3.583333 12.266667 0.000000 0.00000 0
## 2 9.733333 4.950000 23.583333 0.000000 0.00000 0
## 3 68.650000 4.666667 9.116667 0.000000 0.00000 0
## 4 2.816667 12.950000 9.616667 18.483333 13.66667 0
## 5 24.750000 4.016667 10.200000 4.816667 2.25000 0
## 6 1.633333 4.316667 3.716667 0.000000 0.00000 0
## i1_rcf_3_e
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Having done this the dataset is smaller and only includes those variables ending with _e or our real time completion variables. Now we are going to want to separate the stages, we do the same thing only in the reverse, identifying those variable names starting with i1, i2 og i3.
leg1.real.names <- str_subset(colnames(real.time.total), patter = START%R%"i1")
leg2.real.names <- str_subset(colnames(real.time.total), patter = START%R%"i2")
leg3.real.names <- str_subset(colnames(real.time.total), patter = START%R%"i3")
leg1.real.time <- real.time.total[, leg1.real.names]
leg2.real.time <- real.time.total[, leg2.real.names]
leg3.real.time <- real.time.total[, leg3.real.names]Notice how the START%R%“i1” and “_e“%R%END are the mirror images of each other. This order is important since it is telling R where the characters are positioned in the character string, that being the reason why the order is important as well. Now let me quickly do this for the planed time data.
planed.time.names <- str_subset(colnames(data), pattern = "_p"%R%END)
planed.time.total <- data[,planed.time.names]
leg1.planed.names <- str_subset(colnames(planed.time.total), pattern = START%R%"i1")
leg2.planed.names <- str_subset(colnames(planed.time.total), pattern = START%R%"i2")
leg3.planed.names <- str_subset(colnames(planed.time.total), pattern = START%R%"i3")
leg1.planed.time <- planed.time.total[, leg1.planed.names]
leg2.planed.time <- planed.time.total[, leg2.planed.names]
leg3.planed.time <- planed.time.total[, leg3.planed.names]Well, now I can calculate delay and store it in a data frame. This is simple since all my data frames have the same number of variables with the same number of observations (this means rows). I can simply create this data frame by doing a simple subtraction of real time completion from planed time completion.
leg1.delay <- leg1.planed.time - leg1.real.time
leg2.delay <- leg2.planed.time - leg2.real.time
leg3.delay <- leg3.planed.time - leg3.real.timeSo lets add a up all the measurements for each observation (i.e. each row) to see the total delay for each shipment in Leg1
leg1.total.delay <- rowSums(leg1.delay)
summary(leg1.total.delay)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1435.867 8.812 50.125 27.281 67.463 166.433
Since rowSums created a vector, lets convert it into a data frame so I have an easier time plotting this with ggplot.
leg1.total.delay <- data.frame(leg1.total.delay)
names(leg1.total.delay)[1] <- "Leg1.total.delay"
ggplot(data = leg1.total.delay, mapping = aes(x = Leg1.total.delay))+
geom_density(fill = "#3b5998")So, have we made up more time than we lost? Can we use this to find bottlenecks? Sure, but that is out of the scope for this short example on how string manipulation can be extremely powerful in the context of supply chain analytics.
And so I include the Rpubs that taught me so so much, its here, its easy to read and the index makes it easy to find specific functions to solve any need.
\(~\) \(~\) \(~\) \(~\)