Data Manipulation: Working with strings

Eirikur Jonsson

19 April, 2019


Strings and how they relate to supply chain analytics

In supply chains we often are hit with massive amounts of data. Some of it is structured, unstructured, in paper form, digital form and sometimes spread out over multiple databases. Regardless of what form it is, we are at some point going to have to deal with strings. Strings are a data-type that are used in programming languages to represent text, much like integers and floats represent numeric data. There are a few reasons why its important to know how to work with strings when it comes to supply chain analytics, but my main focus now will be on volume.
##Volume and strings: How it helps

Supply chains have massive amounts of measurements taken at differing stages of the chain. Some times those stages are simple, 3 to 5 stages (incoming/outgoing products with returns incoming/outgoing), sometimes it can go into hundreds (international shipping with multiple transits/warehousing along the way). When dealing with large amounts of variables we need to be able to filter out those variables we want from those we don’t want. Here is where strings come into play


The name of every column of a data frame can be treated as a string in R. This makes extracting and filtering variables very easy. In this demonstration I will be working with 3 packages in R, the tidyverse, stringr and rebus. I will link to a great rpubs I found that helped me a great deal in learning string data manipulation, but for now lets look at the analysis and explore the power of R.

Cargo 2000 dataset: Comparing planed time vs actual time

For more detailed information about the data set please look here. For now lets look at the variables.

head(data[1:7,1:7])
##   nr i1_legid i1_rcs_p i1_rcs_e i1_dep_1_p i1_dep_1_e i1_dep_1_place
## 1  0     5182      199      218        210        215            609
## 2  1     6523      844      584         90        297            700
## 3  2     5878     4380     4119         90        280            456
## 4  3     1275      759      169        240        777            173
## 5  4     8117     1597     1485        150        241            411
## 6  5     9889      181       98        240        259            815

In this dataset we have 98 variables and 3942 observations, which is a lot to deal with. In the documentation we find out that variables ending with _p are planed times for each stage and variables ending with _e are real time completion of each stage. What if I want to work only with real time variables to find choke points, or calculate delay time for each stage. Well, stringr and rebus can help us there. We know we want to separate the Ps from the Es, but we don’t want stages interfering with one another. Again the documentation tells us that i1 is stage 1, i2 stage 2 and i3 is stage 3 in incoming logistics. We could manually go through the dataset and index all the variables like this.

#leg1.actual <- data[,c(4,6,9,12,15,18,21,24)]
#leg2.actual <- data[,c(28,30,33,36,39,42,45,48)]
#leg3.actual <- data[,c(52,54,57,60,63,66,69,72)]

But this is very annoying, hard to do (lots of trial and error) and very likely we might end up with the wrong variables in the data.

Lets do this the easy way, let me first convert the unit measurements of minuets to hours so its easier to interpret the data.

data <- data/60

Next lets find all real time completion times and save them to their own data frame. To do this I use the str_subset function since I am extracting the names. Next I indicate from where I am extracting them from, column-names of data, next I use rebus to indicate the pattern, that being _e at the end of each column-name. Once I have those names stored in a vector I can use them to subset my dataset using that vector like I do here down below.

real.time.names <- str_subset(colnames(data), pattern = "_e"%R%END)

real.time.total <- data[,real.time.names]

head(real.time.total[1:7,1:7])
##    i1_rcs_e i1_dep_1_e i1_rcf_1_e i1_dep_2_e i1_rcf_2_e i1_dep_3_e
## 1  3.633333   3.583333  12.266667   0.000000    0.00000          0
## 2  9.733333   4.950000  23.583333   0.000000    0.00000          0
## 3 68.650000   4.666667   9.116667   0.000000    0.00000          0
## 4  2.816667  12.950000   9.616667  18.483333   13.66667          0
## 5 24.750000   4.016667  10.200000   4.816667    2.25000          0
## 6  1.633333   4.316667   3.716667   0.000000    0.00000          0
##   i1_rcf_3_e
## 1          0
## 2          0
## 3          0
## 4          0
## 5          0
## 6          0

Having done this the dataset is smaller and only includes those variables ending with _e or our real time completion variables. Now we are going to want to separate the stages, we do the same thing only in the reverse, identifying those variable names starting with i1, i2 og i3.

leg1.real.names <- str_subset(colnames(real.time.total), patter = START%R%"i1")
leg2.real.names <- str_subset(colnames(real.time.total), patter = START%R%"i2")
leg3.real.names <- str_subset(colnames(real.time.total), patter = START%R%"i3")

leg1.real.time <- real.time.total[, leg1.real.names]
leg2.real.time <- real.time.total[, leg2.real.names]
leg3.real.time <- real.time.total[, leg3.real.names]

Notice how the START%R%“i1” and “_e“%R%END are the mirror images of each other. This order is important since it is telling R where the characters are positioned in the character string, that being the reason why the order is important as well. Now let me quickly do this for the planed time data.

planed.time.names <- str_subset(colnames(data), pattern = "_p"%R%END)
planed.time.total <- data[,planed.time.names]

leg1.planed.names <- str_subset(colnames(planed.time.total), pattern = START%R%"i1")
leg2.planed.names <- str_subset(colnames(planed.time.total), pattern = START%R%"i2")
leg3.planed.names <- str_subset(colnames(planed.time.total), pattern = START%R%"i3")

leg1.planed.time <- planed.time.total[, leg1.planed.names]
leg2.planed.time <- planed.time.total[, leg2.planed.names]
leg3.planed.time <- planed.time.total[, leg3.planed.names]

What can I use this for?

Well, now I can calculate delay and store it in a data frame. This is simple since all my data frames have the same number of variables with the same number of observations (this means rows). I can simply create this data frame by doing a simple subtraction of real time completion from planed time completion.

leg1.delay <- leg1.planed.time - leg1.real.time
leg2.delay <- leg2.planed.time - leg2.real.time
leg3.delay <- leg3.planed.time - leg3.real.time

So lets add a up all the measurements for each observation (i.e. each row) to see the total delay for each shipment in Leg1

leg1.total.delay <- rowSums(leg1.delay)

summary(leg1.total.delay)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1435.867     8.812    50.125    27.281    67.463   166.433

Since rowSums created a vector, lets convert it into a data frame so I have an easier time plotting this with ggplot.

leg1.total.delay <- data.frame(leg1.total.delay)

names(leg1.total.delay)[1] <- "Leg1.total.delay"

ggplot(data = leg1.total.delay, mapping = aes(x = Leg1.total.delay))+
  geom_density(fill = "#3b5998")

So, have we made up more time than we lost? Can we use this to find bottlenecks? Sure, but that is out of the scope for this short example on how string manipulation can be extremely powerful in the context of supply chain analytics.

And so I include the Rpubs that taught me so so much, its here, its easy to read and the index makes it easy to find specific functions to solve any need.

\(~\) \(~\) \(~\) \(~\)