The programming language R is designed to allow rapid and flexible calculations on large and complex data sets. Its efficiency and flexibility rests in part in the way it performs operations. R performs its operations on vectors–lists of values associated with a particular variable name (also known as a key). The values are ordered by an index number, where “c[5]” is the fifth value in the list of values that make up the vector “c”. Data frames are structures in R that include one or more vectors, aligned by their index numbers so that they look like a table. This structure makes it very easy within R to apply the same calculation to all elements in a particular vector, or across elements of two or more vectors.

To take full advantage of this vectorized processing (and create efficient code), we should arrange our data sets so they are “tidy”, in the sense meant by Hadley Wickham. This means that in the data set, each column represents a variable, each row represents a case or an observation, and each cell contains a single value that corresponds to one variable and one observation. Thus, moving down a column, each element represents the value of that variable for a specific case or observation in the data set. Moving across the rows, the elements represent the values of the different variables for the case or observation represented by that index number. Unfortunately, many data sets do not come to us in a tidy fashion. That may be because they were organized differently to begin with, or because we are combining different datasets with different formats into a single larger data frame. When that is the case, performing operations on the data sets can require convoluted code and many steps. Garrett Groleman has a nice overview of data tidying that illustrates the benefits of tidy data over untidy data. The two R packages tidyr and dplyr contain operations that make it much easier to convert untidy data into a tidy format. Brad Boehmke has a clear, concise overview of these functions with examples of their applications.

tidyr

The R package tidyr includes a number of verbs (which might be thought of as commands or operations) that operate on datasets to make them tidy. These include gather(), spread(), separate(), and unite(). These commands are useful when data are messy in the following ways:

gather() and spread()

Sometimes, data sets are organized more for presentation than for analysis. In this case, rather than having columns be the names of variables and rows be observations, the rows and columns may both include values of variables. An example that Hadley Wickham uses is data on the relationship between income and religious preference in the U.S. In this case, each row represents a different religion, each column represents a range of income levels, and each cell contains the number of observations in the data set for that religion and that level of income. Wickham points out that this means that there are three variables in the table: Religion, Income, and Frequency, but they are not arranged in key-value pairs. In a key-value pair, the key states what the value is a value of. So, for example, “Buddhist” would be a value for which “Religion” would be a key.

The verb gather() collects the values that exist in the table as column names, and places them all in a single column under the variable name that describes them. So in the example above, the command gather(pew, income, frequency, -religion) collects the column headings, which are income ranges, and makes them values in a new single column called “income”. It repeats the values of Religion as many times as necessary to make enough rows to accommodate all of the variable combinations. The value of “frequency” that is associated with each combination of religion and income then is contained in a column with the name “frequency”. A table that consisted of 18 rows and 11 columns now consists of three columns–Religion, Income, and Frequency–and 180 rows.

The verb spread() reverses this process. It will take key-value combinations that are in two columns and spread them across rows instead. Thus, spread(data, income, frequency) will take the data frame “data”, convert the unique values in the column “income” into column headings, and assign the values of the variable “frequency” to those columns. This is especially useful when variables are stored in rows as well as columns. This happens when there is a column in the dataset that stores not values, but the names of variables. [Wickham’s overview][Wickham] uses the example of weather data where one of the columns in the dataset indicates the particular weather value being recorded (e.g., maximum temperature, minimum temperature, etc.). These weather elements are actually variable names, that are then associated with particular values in another column. The spread() operation takes those variable names and creates separate columns for them, distributing the appropriate values to those columns. Brad Boehmke also has a nice illustration of this pair of operations in his overview on “data wrangling”

separate() and unite()

Sometimes, datasets will contain multiple variables in one column. This might be a month and a year, for example, such as “JUL-14.” The verb separate() will split the column into two, separating the values at a particular point and assigning them to the appropriate columns. By default, separate() will split the compound value at any sequence of non-alphanumeric values, but you can specify a particular character string as a dividing point, or a specific number of positions counteded from the left end of the string. So, for example, if we had a column with dates in the format MMM-YY, the command separate(data, date, month year, sep = "-") would split the column into two separate ones, breaking the values at the hyphen. The verb unite() reverses the process, combining the vaues in two (or more) different columns into compound values in one column.

dplyr

Five of the seven operations in the package dplyr operate on a single data frame, allowing one to reorder rows, create new variables by transforming existing ones, and choose and summarize subsets of the data. Two additional operations allow you to join two datasets that have values of different variables measured on some or all of the same cases or observations.

Some data to work on:

We’ll use the dataset nycflights13, which contains information on all of the flights departing from any of the three New York City airports in 2013.

Get the data:

if (packageVersion("devtools") < 1.6) {
  install.packages("devtools",repos="http://cran.rstudio.com")
}
if (!require("lazyeval")) {devtools::install_github("hadley/lazyeval")
library(lazyeval)  }
library("datavizcourse")#, #lib.loc="/Library/Frameworks/R.framework/Versions/3.2/Resources/library")
library("dplyr")#, lib.loc="/Library/Frameworks/R.framework/Versions/3.2/Resources/library")
if (!require("nycflights13")) {install.packages("nycflights13",repos="http://cran.rstudio.com")
library(nycflights13)}

Make it easy to display:

tblflights <- tbl_df(flights)
head(tblflights,3) # Can also use print(tblflights,3) instead
## Source: local data frame [3 x 16]
## 
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##   (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1  2013     1     1      517         2      830        11      UA  N14228
## 2  2013     1     1      533         4      850        20      UA  N24211
## 3  2013     1     1      542         2      923        33      AA  N619AA
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

arrange() – reordering rows of data

One of the simplest relationships to examine is a ranking or order in variable values. The command arrange() will re-order the rows of a dataset according to the values of one or more variables. Rows are arranged by default in ascending order (according to numerical value of a quantitative variable or ascii value of a text variable) by the first variable specified. If additional variables are included in the call, these will be used in order to separate rows that are tied according to the previous variables. Rows can be arranged in descending order using the desc() parameter.

For example, to arrange the flight data by departure delays, we can use the following:

arrange(tblflights, desc(dep_delay))
## Source: local data frame [336,776 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1   2013     1     9      641      1301     1242      1272      HA  N384HA
## 2   2013     6    15     1432      1137     1607      1127      MQ  N504MQ
## 3   2013     1    10     1121      1126     1239      1109      MQ  N517MQ
## 4   2013     9    20     1139      1014     1457      1007      AA  N338AA
## 5   2013     7    22      845      1005     1044       989      MQ  N665MQ
## 6   2013     4    10     1100       960     1342       931      DL  N959DL
## 7   2013     3    17     2321       911      135       915      DL  N927DA
## 8   2013     6    27      959       899     1236       850      DL  N3762Y
## 9   2013     7    22     2257       898      121       895      DL  N6716C
## 10  2013    12     5      756       896     1058       878      AA  N5DMAA
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

mutate() – transforming variables to create new ones

Sometimes, one wishes to create new variables which are functions of existing ones. The verb mutate() adds new columns to the data frame and populates them with elements that are functions of existing variables. The previously existing variables are retained in the data frame along with the new ones. If you wish to remove the previous variables and keep only the transformed ones, the operation transmute() will do that.

# prints the new variable but does not store it:
tblflights %>%
    select(dep_delay, arr_delay) %>%
    mutate(delaysquare = dep_delay^2 + arr_delay^2) %>%
    head()
## Source: local data frame [6 x 3]
## 
##   dep_delay arr_delay delaysquare
##       (dbl)     (dbl)       (dbl)
## 1         2        11         125
## 2         4        20         416
## 3         2        33        1093
## 4        -1       -18         325
## 5        -6       -25         661
## 6        -4        12         160
# store the new variable: 
tblflights <- tblflights %>% 
    mutate(delaysquare = dep_delay^2 + arr_delay^2) 

filter() and select() – choosing subsets of the data

The commands filter() and select() are similar in that they allow you to choose only portions of the data to use in subsequent operations; the difference is how they subset the data. The filter() operation collects all rows in the dataset that match specified criteria. The criteria can be specified by a Boolean expression and are usually sets of values of particluar variables in the dataset. So for example, in a dataset that included “Month” as a variable with values 1 - 12, the command filter(data, month < 6) would select only those rows with data from the first six months. (A similar subsetting by rows can be accomplished with the command slice(), which selects the rows by their numerical position in the dataset). The select() command subsets the data by choosing specific columns by name. This is useful when the dataset contains a large number of variables, and you wish to do operations on only a few of them.

For example, we can filter() the tblflights data to show only American Airlines flights originating at LaGuardia Airport.

filter(tblflights, carrier=="AA" & origin=="LGA")
## Source: local data frame [15,459 x 17]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1   2013     1     1      558        -2      753         8      AA  N3ALAA
## 2   2013     1     1      559        -1      941        31      AA  N3DUAA
## 3   2013     1     1      623        13      920         5      AA  N3EMAA
## 4   2013     1     1      629        -1      824        14      AA  N3CYAA
## 5   2013     1     1      635         0     1028        48      AA  N3GKAA
## 6   2013     1     1      656        -4      854         4      AA  N4WNAA
## 7   2013     1     1      659        -1     1008        -7      AA  N3EKAA
## 8   2013     1     1      724        -6     1111        31      AA  N541AA
## 9   2013     1     1      739        -6      918       -12      AA  N4WPAA
## 10  2013     1     1      753        -2     1056       -14      AA  N3HMAA
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl), delaysquare (dbl)
# same as filter(tblflights, carrier=="AA", origin=="LGA")

The “&” character is a Boolean “and” join, which can also be indicated with a comma. A vertical bar (|) indicates an “or” join, so the following filters the dataset for only those records for American Airlines or United Airlines:

filter(tblflights, carrier=="AA" | carrier=="UA") 
## Source: local data frame [91,394 x 17]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1   2013     1     1      517         2      830        11      UA  N14228
## 2   2013     1     1      533         4      850        20      UA  N24211
## 3   2013     1     1      542         2      923        33      AA  N619AA
## 4   2013     1     1      554        -4      740        12      UA  N39463
## 5   2013     1     1      558        -2      753         8      AA  N3ALAA
## 6   2013     1     1      558        -2      924         7      UA  N29129
## 7   2013     1     1      558        -2      923       -14      UA  N53441
## 8   2013     1     1      559        -1      941        31      AA  N3DUAA
## 9   2013     1     1      559        -1      854        -8      UA  N76515
## 10  2013     1     1      606        -4      858       -12      AA  N633AA
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl), delaysquare (dbl)

To select only certain variables, use the select() command. To create a subset of the data that includes only departure and arrival times and flight numbers, use the following:

print(select(tblflights, dep_time, arr_time, flight),n=6)
## Source: local data frame [336,776 x 3]
## 
##    dep_time arr_time flight
##       (int)    (int)  (int)
## 1       517      830   1545
## 2       533      850   1714
## 3       542      923   1141
## 4       544     1004    725
## 5       554      812    461
## 6       554      740   1696
## ..      ...      ...    ...

Other expressions can also be used to specify ranges of columns (separated by a colon), or any column whose name starts_with(), ends_with(), or matches() a regular expression:

head(select(tblflights, flight:dest, contains("arr"), contains("dep")))
## Source: local data frame [6 x 8]
## 
##   flight origin  dest arr_time arr_delay carrier dep_time dep_delay
##    (int)  (chr) (chr)    (int)     (dbl)   (chr)    (int)     (dbl)
## 1   1545    EWR   IAH      830        11      UA      517         2
## 2   1714    LGA   IAH      850        20      UA      533         4
## 3   1141    JFK   MIA      923        33      AA      542         2
## 4    725    JFK   BQN     1004       -18      B6      544        -1
## 5    461    LGA   ATL      812       -25      DL      554        -6
## 6   1696    EWR   ORD      740        12      UA      554        -4

Note that the data set includes the column “carrier” as well as those including arrival information, because the name contains “arr”.

summarize() (or summarise()) – collapsing data into summary statistics

The verb summarize() creates data frames that are made up of values that summarize those in the original data frame. It is most useful when it is performed on subsets of the data that have been grouped by meaningful criteria. Common applications are to get summary statistics such as means, numbers of observations, or maximum or minimum values for comparisons among the groups. In this application, the command group_by() creates the groupings to be summarized, and summarize() calculates the summary statistics. Multiple summary statistics can be calculated for each variable in the same summarize() call, and multiple variables can be summarized at once using summarize_each().

We can apply this to the tblflights data to calcluate the mean departure delay for flights by their airport of origin:

# Create a table grouped by origin, and then summarise each group by taking the mean of dep_delay
tblflights %>%
    group_by(origin) %>%
    summarise(avg_delay = mean(dep_delay, na.rm=TRUE)) %>% 
    head()
## Source: local data frame [3 x 2]
## 
##   origin avg_delay
##    (chr)     (dbl)
## 1    EWR  15.10795
## 2    JFK  12.11216
## 3    LGA  10.34688

To do this for multiple variables, use the summarize_each() command:

# for each carrier, calculate the mean arrival and departure delays at the different origin airports
tblflights %>%
    group_by(origin) %>%
    summarise_each(funs(mean(.,na.rm=TRUE)), arr_delay, dep_delay) %>%
    head()
## Source: local data frame [3 x 3]
## 
##   origin arr_delay dep_delay
##    (chr)     (dbl)     (dbl)
## 1    EWR  9.107055  15.10795
## 2    JFK  5.551481  12.11216
## 3    LGA  5.783488  10.34688

Joining separate datasets with join()

Often, we do not have all of the data we want in a single dataset. This may be because data were collected at different times, or because different variables were measured on the same set of cases at different times. Combining these datasets is done using four different commands with the format xxx_join(). All take the same format: xxx_join(x, y, by = NULL) where x and y are the two dataframes to be joined, and “by =” specifies the variables to join by. If no variables are specified, all variables with names in common across the two dataframes will be included.

Let’s look at three of the nycflights13 datasets:

flights %>% head(3) 
## Source: local data frame [3 x 16]
## 
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##   (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1  2013     1     1      517         2      830        11      UA  N14228
## 2  2013     1     1      533         4      850        20      UA  N24211
## 3  2013     1     1      542         2      923        33      AA  N619AA
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)
airlines %>% head(3)
## Source: local data frame [3 x 2]
## 
##   carrier                   name
##    (fctr)                 (fctr)
## 1      9E      Endeavor Air Inc.
## 2      AA American Airlines Inc.
## 3      AS   Alaska Airlines Inc.
airports %>% head(3)
## Source: local data frame [3 x 7]
## 
##     faa                          name      lat       lon   alt    tz   dst
##   (chr)                         (chr)    (dbl)     (dbl) (int) (dbl) (chr)
## 1   04G             Lansdowne Airport 41.13047 -80.61958  1044    -5     A
## 2   06A Moton Field Municipal Airport 32.46057 -85.68003   264    -5     A
## 3   06C           Schaumburg Regional 41.98934 -88.10124   801    -6     A

The different types of joins are as follows:

inner_join(x,y): matching x + y

This includes only rows of x and y that have values in both.

flights %>% inner_join(airlines) %>% head(3)
## Joining by: "carrier"
## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
## Source: local data frame [3 x 17]
## 
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##   (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1  2013     1     1      517         2      830        11      UA  N14228
## 2  2013     1     1      533         4      850        20      UA  N24211
## 3  2013     1     1      542         2      923        33      AA  N619AA
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl), name (fctr)
flights %>% inner_join(airlines) %>% select(distance:name) %>% head(3)
## Joining by: "carrier"
## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
## Source: local data frame [3 x 4]
## 
##   distance  hour minute                   name
##      (dbl) (dbl)  (dbl)                 (fctr)
## 1     1400     5     17  United Air Lines Inc.
## 2     1416     5     33  United Air Lines Inc.
## 3     1089     5     42 American Airlines Inc.

semi_join(x, y) : all x with match in y

This keeps all rows in x that have a match in y, but keeps only the variables (columns) from x. This essentially subsets the data frame x to those observations that also occur in y.

airports %>% left_join(faaflights) %>% head(2)
## Joining by: "faa"
## Source: local data frame [2 x 8]
## 
##     faa                          name      lat       lon   alt    tz   dst
##   (chr)                         (chr)    (dbl)     (dbl) (int) (dbl) (chr)
## 1   04G             Lansdowne Airport 41.13047 -80.61958  1044    -5     A
## 2   06A Moton Field Municipal Airport 32.46057 -85.68003   264    -5     A
## Variables not shown: origin (chr)
airports %>% semi_join(faaflights) %>% head(2)
## Joining by: "faa"
## Source: local data frame [2 x 7]
## 
##     faa                name      lat       lon   alt    tz   dst
##   (chr)               (chr)    (dbl)     (dbl) (int) (dbl) (chr)
## 1   EWR Newark Liberty Intl 40.69250 -74.16867    18    -5     A
## 2   LGA          La Guardia 40.77725 -73.87261    22    -5     A

anti_join(x, y) : all x without match in y

This is the opposite of a semi-join. It keeps only the rows in x that do not have a matching value in y, and keeps only the columns from x.

faaflights %>% anti_join(airports)
## Joining by: "faa"
## Source: local data frame [0 x 2]
## 
## Variables not shown: origin (chr), faa (chr)
airports %>% anti_join(faaflights)
## Joining by: "faa"
## Source: local data frame [1,394 x 7]
## 
##      faa                        name      lat        lon   alt    tz   dst
##    (chr)                       (chr)    (dbl)      (dbl) (int) (dbl) (chr)
## 1    ZYP                Penn Station 40.75050  -73.99350    35    -5     A
## 2    ZWU    Washington Union Station 38.89746  -77.00643    76    -5     A
## 3    ZWI   Wilmington Amtrak Station 39.73667  -75.55167     0    -5     A
## 4    ZVE      New Haven Rail Station 41.29867  -72.92599     7    -5     A
## 5    ZUN                  Black Rock 35.08323 -108.79178  6454    -7     A
## 6    ZTY     Boston Back Bay Station 42.34780  -71.07500    20    -5     A
## 7    ZTF     Stamford Amtrak Station 41.04694  -73.54149     0    -5     A
## 8    ZSY          Scottsdale Airport 33.62289 -111.91053  1519    -7     A
## 9    ZSF  Springfield Amtrak Station 42.10600  -72.59305    65    -5     A
## 10   ZRZ New Carrollton Rail Station 38.94800  -76.87190    39    -5     A
## ..   ...                         ...      ...        ...   ...   ...   ...

Keeping your code tidy with pipes: %>%

The purpose of tidying data is to make efficient use of R’s processing, which means being able to accomplish the desired steps with a relatively small amount of code. This makes for “tidy” coding. An additional way to make code efficient and readable is to use “pipes” to pass results of one process as arguments for the next. Stefan Milton Brache created the pipe %>% for this purpose, and it is very useful in combination with the commands in tidyr and dplyr. The basic syntax is that %>% passes the value or vector that precedes it to the function or expression that follows it. This avoids the use of nested parentheses that can be hard to follow (and type accurately), or the creation of multiple intermediate objects that hold values temporarily. Boehmke gives the following example:

### Nested Option:

    arrange(
            summarize(
                filter(data, variable == numeric_value),
                Total = sum(variable)
            ),
        desc(Total)
    )

### Multiple Object Option:

     a <- filter(data, variable == numeric_value)
     b <- summarise(a, Total = sum(variable))
     c <- arrange(b, desc(Total))

### %>% Option:

     data %>%
            filter(variable == “value”) %>%
            summarise(Total = sum(variable)) %>%
            arrange(desc(Total))

With the commands available in the tidyr and dplyr packages, linked efficiently with pipes, one can produce data frames that are ready to make maximum use of the processing power of R.