Merging & Manipulating Data: An Introduction & Refresher

1. The Data for the Fashion Logistics Project

The data for the fashion logistics case are stored in STATA.

Luckily, R's foreign package has a couple of functions to import data that is stored in formats like STATA, SPSS and other software packages.

For reading data stored in STATA format, we can make use of the read.dta() command.

The foreign package is part of the system library that comes with installing R and RStudio. That means that you do not have to install it using install.packages("foreign") - although it does no harm if you do so - but you do have to invoke the functions in foreign by running library(foreign).

1.1 The Data Set(s)

Within the project, data has been collected on shipments.

Shipments have both a sender and a receiver, obviously. Since - especially when in the clothing segment - there are a lot of return shipments, organizations can be senders of some shipments and receivers of other shipments.

In the data set, postal codes (zip codes) have been recorded. A separate data set contains the coordinates of the (4-digit) Dutch postal codes. Based on these coordinates, it is possible to measure the "as the crow flies" distance between postal codes. These distances follows the Pythagorean Theorem, which you may remember from your days in high school.

The theorem states that the square of the hypotenuse of a right triangle is equal to the sum of the squares on the other two sides. Pretty cryptic. The video below clarifies things.

We will use the longitude and latitude later on, after some easier-to-digest examples.

Other elements in the data set indicate the way the goods are packaged or transported. Examples are boxes and crates, and the quantities. A shipment may consists of, say, 3 crates. However, in order to measure or estimate the volume of the shipment, in square meters, we have to convert the quantities and packaging types, using conversion ratios. For example, 1 box has an average volume of 0.1 m³.

In a nutshell, using (hopefully correctly) the way we would depict the data structure in archi, we have something like this:

For statistical and analytical purposes, we prefer to have the data in rectangular form, with rows and columns.

The rows represent, in our case, shipments, and the columns information about these shipments. The data in rectangular form, comes under various names. Data matrix would be an obvious name. When using R, we normally speak of a data frame. This is the preferred term to us, since matrices in R have a special meaning. A data frame is, in essence, a special form of a matrix.

The information in the columns, are commonly referred to as variables.

In order to produce our data frame, from the various relational data sets in the figure above, we need to merge files.

So, let's start with an introduction to merging files in R, to get a better understanding of what we mean by merging!

1.2 Introduction to Merging

There are various situations we can think of when merging data.

The three most common situations are depicted below.

In the situation on the left, we have two files which both provide unique information on cases, but not (necessarily) on the same cases.

As an example, we do have the postal codes of senders and receivers participating in the project (data frame A), but most likely there are way more postal codes in the country that are not linked to any of these senders and receivers. We have postal codes in the city of Leeuwarden, but none of the senders and receivers are located in Leeuwarden. If, for some odd reason, we would still like to keep these postal codes in the merged data set, then we have to make a full-join.

The situation in the middle, a left-join, would be more logical for our purposes. Here, we leave out all the postal codes (and the longitude and latitude information related to these postal codes) which do not occur in our list of senders and receivers.

The situation on the right is not what we would normally refer to as merging. Merging adds more information (columns, or variables) on cases in which we are interested. In the third situation, we mainly add more cases to our starting set of cases.

Our data refers to shipments in one specific year. Adding the same information on shipments taking place in other years, would be an example.

Various software programs deal with these situations in different manners: some require that the structure of the "appended" data sets is identical (all columns in A and B, that is, are the same), while others can handle situations in which A and B at least share some columns. The new file, in th latter case, will include information that is unique for A and/or B.

1.3 Some Easy Examples

Introduction - Methods of Merging in R

There are various ways to merge data in R. Three methods, which mainly differ in speed and efficiency, are discussed in the video below.

Since speed, even in our vast data set of over 2 million cases, is not really an issue, we will use the functions in base R.

Example 1

OK, let's take an easy example that is similar, in structure, to what we need to do with the real data set later on.

library(foreign)
(x1 <- read.dta("x1_12.dta"))

(x2 <- read.dta("x2_12.dta"))

Some notes:

Reading data from STATA files uses the function read.dta(). STATA files can be recognized by the extension .dta. Unless you STATA installed on your computer, you cannot open and read .dta files directly.
The read.dta() function is part of the foreign package. This package is part of the system library, which means that you do not have to install it. However, you do need to invoke using the library() statement.
If you obtain data stored in STATA from someone else, then ask him or her to save the data in one of the older versions of STATA, as read.dta() cannot (yet) read the latest formats of STATA.
There are various options in the read.dta() functions. Here, we stick to the default settings. More information can be found by clicking on the foreign package, under the Packages tab in RStudio.

The data in x1_12.dta have information on three organizations (id = 1..3), with various instances (e.g., shipments) per organization.

x1_12.dta stores the postal codes of each of these three organizations. The postal codes are fictitious 4-digit postal codes (1111, for organization 1, and so on).

We now want to add the 4-digit postal codes to our rectangular data on shipments, so that we know from where (or from which postal codes) the goods are sent.

It's good practice to have the key information used for merging stored in variables of the same name (here, id). Some software programs cannot even merge data if a common variable is absent. R, however, is more flexible.

By default, merge() uses a left-join. The first data set (x1) is the main data set, to which we want to add information from a look-up table (x2). In our case, the look-up table has the same values for id as the main data set, but in principle it can contain many more organizations and postal codes in which we are not interested for the task at hand.

# merge two data frames by (left-join)
(total <- merge(x1,x2,by="id"))

As you can see, for each and every occurrence of an organization, the postal code of that organization is now added to the data frame total.

For the sake of illustration, let's see what happens when we ask for a full-join rather than the default left-join.

We read data from x2e_12.dta, an extended version of x2_12.dta with data on organizations 4 and 5 which do not occur in our main data set.

(x2e <- read.dta("x2e_12.dta"))

leftjoin <- merge(x1,x2e,by="id"); leftjoin

fulljoin <- merge(x1,x2e,by="id", all.y=T); fulljoin

Example 2

In this example, we first read a data file linking senders to receivers.

Each of the organizations (coded 1..9) can act as a sender and a receiver - but, obviously, not in one and the same shipment. They are not sending stuff to themselves!

The data is stored in sr.dta. Who is sending to whom, can be checked in a cross tabulation. The base R function is table(), but the CrossTable() function from the gmodels package is more informative. You can copy it, and run it in your own script.

Since the diagonal in the table contains zeroes only, indeed there are no shipments where the sender and receiver are the same. So, that's good.

sr <- read.dta("sr.dta"); sr

summary(sr)

##      sender       receiver  
##  Min.   :1.0   Min.   :1.0  
##  1st Qu.:2.0   1st Qu.:2.0  
##  Median :3.5   Median :4.0  
##  Mean   :4.0   Mean   :4.4  
##  3rd Qu.:6.0   3rd Qu.:7.0  
##  Max.   :9.0   Max.   :9.0

table(sr$receiver,sr$sender)

##    
##     1 2 3 4 5 6 7 8 9
##   1 0 1 0 2 0 0 1 0 0
##   2 1 0 0 0 0 0 1 1 1
##   3 1 0 0 0 0 0 0 0 0
##   4 0 1 1 0 0 0 0 0 0
##   5 0 0 1 0 0 0 0 0 0
##   6 0 1 0 0 1 0 0 0 0
##   7 1 0 0 0 0 1 0 0 0
##   8 1 0 0 0 0 1 0 0 0
##   9 0 1 0 0 0 1 0 0 0

# 2-Way Cross Tabulation
# install.packages("gmodels")
# library(gmodels)
# CrossTable(sr$receiver,sr$sender)

We now read the zipcodes of the 9 organizations.

zip_sr <- read.dta("zip.dta"); zip_sr

Next we merge it with the shipment data, in sr.

First, we add the zipcodes to the senders.

Since the names of the columns differ, we have to specify them. We use by.x to indicate the name of the key variable in the master file, and by.y for the corresponding name of the variable in the look-up table.

Information in any organizations appearing in the look-up table but not in the master file, is not of interest for our analysis. It would be better to use all.y=F. Since there are no such cases, we could leave out the all.y option altogether (the default is all.y=F). Setting it to T is of no consequence. Check it for yourself!

Note that we use T and F, for TRUE and FALSE. This is common practice, but also bad practice! T and F can, in principle, be the names of objects you have created, and lead to wrong outcomes. Shame on us 😈

We have no postal code for organization 9. In the merged file, NA appears (for Not Available).

sr_szip <- merge(sr,zip_sr,by.x="sender",by.y="id",all.x=T,all.y=T);
head(sr_szip);tail(sr_szip)

We have to do the same for receiver.

A tiny problem is that in the merged data, we already have a variable called pc4, which is the postal code of the sender. It's always good to provide meaningful names to the variables in the data set. Let's give a new name to pc4.

We use the rename() command from the reshape package. We overwrite the old name with the new name. pc4s is the new name for the postal code of the sender.

# rename 
library(reshape)

## Warning: package 'reshape' was built under R version 4.0.3

sr_szip <- rename(sr_szip, c(pc4="pc4s"))

OK, next we add the postal codes of the receivers to the data set.

In the by.x we indicate receiver, and the rest is as above. We store the new information in a new object, sr_srzip, which reads like "the sender and the receiver, with postal codes on both added".

We print the first and last six rows of the data frame, using the head() and tail() commands. We rename pc4 to pc4r, the postal code of the receiver.

sr_srzip <- merge(sr_szip,zip_sr,by.x="receiver",by.y="id",all.x=T,all.y=T);
library(reshape)
sr_srzip <- rename(sr_srzip, c(pc4="pc4r"))
head(sr_srzip);tail(sr_srzip)

By itself, the order of variables and observations is of no consequence to the analyses we wish to perform. But, in order not to lose oversight, many analysts feel more comfortable to have the data sorted neatly.

When we started out with sr, the data was sorted by first sender, followed by receiver. Sender appeared first in the data frame. After the merging operations, you note that the information is now all over the place.

Let's fix it!

You can store your preferred order of variables (or columns), in an object (here: col_order). If you have forgotten the names of the columns, you can retrieve them with the names() command.

In order to avoid adding the name of the data frame, we use attach(). R then understands that the variables (e.g., sender, pc4s, ..) are part of the data frame sr_srzip.

We then sort the data by sender and receiver, and put the variables in the desired columns order.

names(sr_srzip)

## [1] "receiver" "sender"   "pc4s"     "pc4r"

col_order <- c("sender","pc4s","receiver","pc4r")
attach(sr_srzip)
sr_srzip <- sr_srzip[order(sender,receiver),col_order]
sr_srzip

Looks OK?

Example 3

We are already well on our way. But we are not there yet!

Obviously, the postal codes are of interest for all kinds of administrative and operational purposes. The truck drivers need address information, to plan their routes. The financial administration has to send invoices to the right addresses. And so on.

However, for our analyses, we are interested in the geographical distances between the locations of the sender and the receiver. Distances "as the crow flies" can be used to estimate driving times and traveling distances.

The coordinates of the postal codes, in some arbitrary unit, are depicted below.

The x and y coordinates of postal code 1111 are, for example, (1,1). The x and y coordinates of postal code 3333 are (3,4).

Using the Pythagorean Theorem, we can compute the straight line distance from 1111 to 2222.

\(c^2 = a^2 + b^2\) → c = √(a² + b²)

The straight line distance between 1111 and 3333 then equals:

c = √((1-3)² + (1-4)²) = √(4+9) = √13 = 3.61

1.4 Assignment

OK, here's your assignment.

The coordinates of the postal codes (1111 to 9999), are in zipcoor.dta. Read the data.
Merge the coordinates to the postal codes of the receiver and the sender.
Compute the straight line distances between the senders and receivers. Add the variable containing this information to the data frame.

Solution

Below, you find a solution. There are many ways to do it, of course.

## Assignment
## -1- The coordinates of the postal codes (1111 to 9999), are in **zipcoor.dta**. Read the data.  
## -2- Merge the coordinates to the postal codes of the receiver and the sender.  
## -3- Compute the straight line distances between the senders and receivers. Add the variable containing this information to the data frame.  

## Solution

rm(list=ls()) # We remove all objects in memory, and start afresh.

## -1-
library(foreign)
zipcoor <- read.dta("zipcoor.dta"); zipcoor

## -2-

## Run all previous code, to
sr <- read.dta("sr.dta")
zip_sr <- read.dta("zip.dta")
sr_szip <- merge(sr,zip_sr,by.x="sender",by.y="id",all.x=T,all.y=T);
library(reshape)
sr_szip <- rename(sr_szip, c(pc4="pc4s"))
sr_srzip <- merge(sr_szip,zip_sr,by.x="receiver",by.y="id",all.x=T)
sr_srzip <- rename(sr_srzip, c(pc4="pc4r"))
col_order <- c("sender","pc4s","receiver","pc4r")
sr_srzip <- sr_srzip[order(sr_srzip$sender,sr_srzip$receiver),col_order]
head(sr_srzip)

head(zipcoor)

## step (a): Merge coordinates to postal code sender

a <- merge(sr_srzip,zipcoor,by.x="pc4s",by.y="zip",all.x=T,all.y=F)
a <- rename(a, c(xcoor="xcs",ycoor="ycs"))
a

## step (b): Merge coordinates to postal code receiver

b <- merge(a,zipcoor,by.x="pc4r",by.y="zip",all.x=T,all.y=F)
b <- rename(b, c(xcoor="xcr",ycoor="ycr"))
b

## Nicely looking, sorted data ...

col_order <- c("sender","pc4s","xcs","ycs","receiver","pc4r","xcr","ycr")
sr_final <- b[order(b$sender,b$receiver),col_order]; head(sr_final)

## -3- 

attach(sr_final)

## The following objects are masked from sr_srzip:
## 
##     pc4r, pc4s, receiver, sender

sr_final$dist <- sqrt((xcs-xcr)^2 + (ycs-ycr)^2)
sr_final

detach(sr_final)

1.5 (A Sample of) The Real Data for the Project

We have upload the following files on https://github.com/ssmresearch/datasets.

source_all_sample2000_stata12.dta
pc432_lat_lon_stata12.dta
m3_stata12.dta

The first file (source_all_sample2000_stata12.dta) contains a sample of 2,000 records, which is less than 0.1% of the complete data set, but enough to develop and test your scripts before applying them to the full data set.

The second file (pc432_lat_lon_stata12.dta) contains (Dutch) postal codes, and their GPS coordinates, in latitude and longitude.

Dutch postal codes have a six-digit format ("1234AB"). The coordinates of postal codes have been supplied by a commercial provider. As an approximation, we have aggregated the coordinates at four, three and two digit levels.

Unfortunately, some of the postal codes in the real data set are not in the data sets on postal codes (probably because of errors; new or provisional postal codes in industrial areas; and so on). The idea is to match coordinates at the most detailed level available (four digits, max).

The postal codes are already determined in terms of sender ("from") and receiver ("to"). Flow types are outgoing ("out"), from wholesalers and distribution centers to retailers, or incoming ("in"), most likely return flows from retailers to wholesalers. In addition, there are flows from distribution centers to their branches (retailers).

The shipments have a packaging code. Actually, the file contains two codes: a two-digit alphanumeric code, like "PA" for pallets, and a numeric code (e.g., 4, for pallets). One of the codes is obviously redundant. Each packaging type can be converted to an (average) volume, in cubic metres (m³). The conversion factors are in the third file (m3_stata12.dta).

The first tasks are to read all data, in R.

Next, we want to merge the coordinates of the "to" and "from" postal codes, so that we can compute (straight, beeline) distances. The formula for deriving distances from latitude and longitude coordinates is a bit of a nightmare, which has to do with trigonometric computations, and the fact that the world is not flat. At least, most of us tend to think so.

The formula is:

\(distance = acos(sin(lat1r)*sin(lat2r)+cos(lat1r)*cos(lat2r)*cos(lon2r-lon1r))* 6371\)

In this formula, \(lat1r\) is the latitude in radians, computed as lat1r=lat1/(180/π) (π=3.71828...).

For the moment, just focus on merging the data. The formula will come later ...

Some tips when writing your script:

Create a directory for this project, where you store all files for the project.
Make this directory your working directory, using the setwd("pathname") command.

In the pathname, use either single forward slashes (/) or double backwardslashes (\).

The commands below both work. The pathname, of course, depends on how you have organized your computer, folders and directories.

setwd("C:\\06. FASHIONLOG")
setwd("C:/06. FASHIONLOG")

Use a lot of comments. It makes your script easy to tead for others, and for yourself!
Save your script regularly.

With regard to (3.), an elegant way to combine text and script, is to use R Markdown. This very document is written in R Markdown. Rather than opening a new script, you can open an R Markdown file. Flat, unformatted next, will do. However, an R Markdown cheat sheet will enable you to make nice documents. You can publish them, and share wih others, e.g. via RPubs.