The programming language R is designed to allow rapid and flexible calculations on large and complex data sets. Its efficiency and flexibility rests in part in the way it performs operations. R performs its operations on vectors–lists of values associated with a particular variable name (also known as a key). The values are ordered by an index number, where “c[5]” is the fifth value in the list of values that make up the vector “c”. Data frames are structures in R that include one or more vectors, aligned by their index numbers so that they look like a table. This structure makes it very easy within R to apply the same calculation to all elements in a particular vector, or across elements of two or more vectors.
To take full advantage of this vectorized processing (and create efficient code), we should arrange our data sets so they are “tidy”, in the sense meant by Hadley Wickham. This means that in the data set, each column represents a variable, each row represents a case or an observation, and each cell contains a single value that corresponds to one variable and one observation. Thus, moving down a column, each element represents the value of that variable for a specific case or observation in the data set. Moving across the rows, the elements represent the values of the different variables for the case or observation represented by that index number. Unfortunately, many data sets do not come to us in a tidy fashion. That may be because they were organized differently to begin with, or because we are combining different datasets with different formats into a single larger data frame. When that is the case, performing operations on the data sets can require convoluted code and many steps. Garrett Groleman has a nice overview of data tidying that illustrates the benefits of tidy data over untidy data. The two R packages tidyr and dplyr contain operations that make it much easier to convert untidy data into a tidy format. Brad Boehmke has a clear, concise overview of these functions with examples of their applications.
The R package tidyr includes a number of verbs (which might be thought of as commands or operations) that operate on datasets to make them tidy. These include gather()
, spread()
, separate()
, and unite()
. These commands are useful when data are messy in the following ways:
Sometimes, data sets are organized more for presentation than for analysis. In this case, rather than having columns be the names of variables and rows be observations, the rows and columns may both include values of variables. An example that Hadley Wickham uses is data on the relationship between income and religious preference in the U.S. In this case, each row represents a different religion, each column represents a range of income levels, and each cell contains the number of observations in the data set for that religion and that level of income. Wickham points out that this means that there are three variables in the table: Religion, Income, and Frequency, but they are not arranged in key-value pairs. In a key-value pair, the key states what the value is a value of. So, for example, “Buddhist” would be a value for which “Religion” would be a key.
The verb gather()
collects the values that exist in the table as column names, and places them all in a single column under the variable name that describes them. So in the example above, the command gather(pew, income, frequency, -religion)
collects the column headings, which are income ranges, and makes them values in a new single column called “income”. It repeats the values of Religion as many times as necessary to make enough rows to accommodate all of the variable combinations. The value of “frequency” that is associated with each combination of religion and income then is contained in a column with the name “frequency”. A table that consisted of 18 rows and 11 columns now consists of three columns–Religion, Income, and Frequency–and 180 rows.
The verb spread()
reverses this process. It will take key-value combinations that are in two columns and spread them across rows instead. Thus, spread(data, income, frequency)
will take the data frame “data”, convert the unique values in the column “income” into column headings, and assign the values of the variable “frequency” to those columns. This is especially useful when variables are stored in rows as well as columns. This happens when there is a column in the dataset that stores not values, but the names of variables. [Wickham’s overview][Wickham] uses the example of weather data where one of the columns in the dataset indicates the particular weather value being recorded (e.g., maximum temperature, minimum temperature, etc.). These weather elements are actually variable names, that are then associated with particular values in another column. The spread()
operation takes those variable names and creates separate columns for them, distributing the appropriate values to those columns. Brad Boehmke also has a nice illustration of this pair of operations in his overview on “data wrangling”
Sometimes, datasets will contain multiple variables in one column. This might be a month and a year, for example, such as “JUL-14.” The verb separate()
will split the column into two, separating the values at a particular point and assigning them to the appropriate columns. By default, separate()
will split the compound value at any sequence of non-alphanumeric values, but you can specify a particular character string as a dividing point, or a specific number of positions counteded from the left end of the string. So, for example, if we had a column with dates in the format MMM-YY, the command separate(data, date, month year, sep = "-")
would split the column into two separate ones, breaking the values at the hyphen. The verb unite()
reverses the process, combining the vaues in two (or more) different columns into compound values in one column.
Five of the seven operations in the package dplyr
operate on a single data frame, allowing one to reorder rows, create new variables by transforming existing ones, and choose and summarize subsets of the data. Two additional operations allow you to join two datasets that have values of different variables measured on some or all of the same cases or observations.
We’ll use the dataset nycflights13
, which contains information on all of the flights departing from any of the three New York City airports in 2013.
Get the data:
if (packageVersion("devtools") < 1.6) {
install.packages("devtools",repos="http://cran.rstudio.com")
}
if (!require("lazyeval")) {devtools::install_github("hadley/lazyeval")
library(lazyeval) }
library("datavizcourse")#, #lib.loc="/Library/Frameworks/R.framework/Versions/3.2/Resources/library")
library("dplyr")#, lib.loc="/Library/Frameworks/R.framework/Versions/3.2/Resources/library")
if (!require("nycflights13")) {install.packages("nycflights13",repos="http://cran.rstudio.com")
library(nycflights13)}
Make it easy to display:
tblflights <- tbl_df(flights)
head(tblflights,3) # Can also use print(tblflights,3) instead
## Source: local data frame [3 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr)
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
One of the simplest relationships to examine is a ranking or order in variable values. The command arrange()
will re-order the rows of a dataset according to the values of one or more variables. Rows are arranged by default in ascending order (according to numerical value of a quantitative variable or ascii value of a text variable) by the first variable specified. If additional variables are included in the call, these will be used in order to separate rows that are tied according to the previous variables. Rows can be arranged in descending order using the desc()
parameter.
For example, to arrange the flight data by departure delays, we can use the following:
arrange(tblflights, desc(dep_delay))
## Source: local data frame [336,776 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr)
## 1 2013 1 9 641 1301 1242 1272 HA N384HA
## 2 2013 6 15 1432 1137 1607 1127 MQ N504MQ
## 3 2013 1 10 1121 1126 1239 1109 MQ N517MQ
## 4 2013 9 20 1139 1014 1457 1007 AA N338AA
## 5 2013 7 22 845 1005 1044 989 MQ N665MQ
## 6 2013 4 10 1100 960 1342 931 DL N959DL
## 7 2013 3 17 2321 911 135 915 DL N927DA
## 8 2013 6 27 959 899 1236 850 DL N3762Y
## 9 2013 7 22 2257 898 121 895 DL N6716C
## 10 2013 12 5 756 896 1058 878 AA N5DMAA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
Sometimes, one wishes to create new variables which are functions of existing ones. The verb mutate()
adds new columns to the data frame and populates them with elements that are functions of existing variables. The previously existing variables are retained in the data frame along with the new ones. If you wish to remove the previous variables and keep only the transformed ones, the operation transmute()
will do that.
# prints the new variable but does not store it:
tblflights %>%
select(dep_delay, arr_delay) %>%
mutate(delaysquare = dep_delay^2 + arr_delay^2) %>%
head()
## Source: local data frame [6 x 3]
##
## dep_delay arr_delay delaysquare
## (dbl) (dbl) (dbl)
## 1 2 11 125
## 2 4 20 416
## 3 2 33 1093
## 4 -1 -18 325
## 5 -6 -25 661
## 6 -4 12 160
# store the new variable:
tblflights <- tblflights %>%
mutate(delaysquare = dep_delay^2 + arr_delay^2)
The commands filter()
and select()
are similar in that they allow you to choose only portions of the data to use in subsequent operations; the difference is how they subset the data. The filter()
operation collects all rows in the dataset that match specified criteria. The criteria can be specified by a Boolean expression and are usually sets of values of particluar variables in the dataset. So for example, in a dataset that included “Month” as a variable with values 1 - 12, the command filter(data, month < 6)
would select only those rows with data from the first six months. (A similar subsetting by rows can be accomplished with the command slice()
, which selects the rows by their numerical position in the dataset). The select()
command subsets the data by choosing specific columns by name. This is useful when the dataset contains a large number of variables, and you wish to do operations on only a few of them.
For example, we can filter()
the tblflights
data to show only American Airlines flights originating at LaGuardia Airport.
filter(tblflights, carrier=="AA" & origin=="LGA")
## Source: local data frame [15,459 x 17]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr)
## 1 2013 1 1 558 -2 753 8 AA N3ALAA
## 2 2013 1 1 559 -1 941 31 AA N3DUAA
## 3 2013 1 1 623 13 920 5 AA N3EMAA
## 4 2013 1 1 629 -1 824 14 AA N3CYAA
## 5 2013 1 1 635 0 1028 48 AA N3GKAA
## 6 2013 1 1 656 -4 854 4 AA N4WNAA
## 7 2013 1 1 659 -1 1008 -7 AA N3EKAA
## 8 2013 1 1 724 -6 1111 31 AA N541AA
## 9 2013 1 1 739 -6 918 -12 AA N4WPAA
## 10 2013 1 1 753 -2 1056 -14 AA N3HMAA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl), delaysquare (dbl)
# same as filter(tblflights, carrier=="AA", origin=="LGA")
The “&” character is a Boolean “and” join, which can also be indicated with a comma. A vertical bar (|) indicates an “or” join, so the following filters the dataset for only those records for American Airlines or United Airlines:
filter(tblflights, carrier=="AA" | carrier=="UA")
## Source: local data frame [91,394 x 17]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr)
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 554 -4 740 12 UA N39463
## 5 2013 1 1 558 -2 753 8 AA N3ALAA
## 6 2013 1 1 558 -2 924 7 UA N29129
## 7 2013 1 1 558 -2 923 -14 UA N53441
## 8 2013 1 1 559 -1 941 31 AA N3DUAA
## 9 2013 1 1 559 -1 854 -8 UA N76515
## 10 2013 1 1 606 -4 858 -12 AA N633AA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl), delaysquare (dbl)
To select only certain variables, use the select()
command. To create a subset of the data that includes only departure and arrival times and flight numbers, use the following:
print(select(tblflights, dep_time, arr_time, flight),n=6)
## Source: local data frame [336,776 x 3]
##
## dep_time arr_time flight
## (int) (int) (int)
## 1 517 830 1545
## 2 533 850 1714
## 3 542 923 1141
## 4 544 1004 725
## 5 554 812 461
## 6 554 740 1696
## .. ... ... ...
Other expressions can also be used to specify ranges of columns (separated by a colon), or any column whose name starts_with()
, ends_with()
, or matches()
a regular expression:
head(select(tblflights, flight:dest, contains("arr"), contains("dep")))
## Source: local data frame [6 x 8]
##
## flight origin dest arr_time arr_delay carrier dep_time dep_delay
## (int) (chr) (chr) (int) (dbl) (chr) (int) (dbl)
## 1 1545 EWR IAH 830 11 UA 517 2
## 2 1714 LGA IAH 850 20 UA 533 4
## 3 1141 JFK MIA 923 33 AA 542 2
## 4 725 JFK BQN 1004 -18 B6 544 -1
## 5 461 LGA ATL 812 -25 DL 554 -6
## 6 1696 EWR ORD 740 12 UA 554 -4
Note that the data set includes the column “carrier” as well as those including arrival information, because the name contains “arr”.
The verb summarize()
creates data frames that are made up of values that summarize those in the original data frame. It is most useful when it is performed on subsets of the data that have been grouped by meaningful criteria. Common applications are to get summary statistics such as means, numbers of observations, or maximum or minimum values for comparisons among the groups. In this application, the command group_by()
creates the groupings to be summarized, and summarize()
calculates the summary statistics. Multiple summary statistics can be calculated for each variable in the same summarize()
call, and multiple variables can be summarized at once using summarize_each()
.
We can apply this to the tblflights
data to calcluate the mean departure delay for flights by their airport of origin:
# Create a table grouped by origin, and then summarise each group by taking the mean of dep_delay
tblflights %>%
group_by(origin) %>%
summarise(avg_delay = mean(dep_delay, na.rm=TRUE)) %>%
head()
## Source: local data frame [3 x 2]
##
## origin avg_delay
## (chr) (dbl)
## 1 EWR 15.10795
## 2 JFK 12.11216
## 3 LGA 10.34688
To do this for multiple variables, use the summarize_each()
command:
# for each carrier, calculate the mean arrival and departure delays at the different origin airports
tblflights %>%
group_by(origin) %>%
summarise_each(funs(mean(.,na.rm=TRUE)), arr_delay, dep_delay) %>%
head()
## Source: local data frame [3 x 3]
##
## origin arr_delay dep_delay
## (chr) (dbl) (dbl)
## 1 EWR 9.107055 15.10795
## 2 JFK 5.551481 12.11216
## 3 LGA 5.783488 10.34688
Often, we do not have all of the data we want in a single dataset. This may be because data were collected at different times, or because different variables were measured on the same set of cases at different times. Combining these datasets is done using four different commands with the format xxx_join()
. All take the same format: xxx_join(x, y, by = NULL)
where x and y are the two dataframes to be joined, and “by =” specifies the variables to join by. If no variables are specified, all variables with names in common across the two dataframes will be included.
Let’s look at three of the nycflights13
datasets:
flights %>% head(3)
## Source: local data frame [3 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr)
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
airlines %>% head(3)
## Source: local data frame [3 x 2]
##
## carrier name
## (fctr) (fctr)
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
airports %>% head(3)
## Source: local data frame [3 x 7]
##
## faa name lat lon alt tz dst
## (chr) (chr) (dbl) (dbl) (int) (dbl) (chr)
## 1 04G Lansdowne Airport 41.13047 -80.61958 1044 -5 A
## 2 06A Moton Field Municipal Airport 32.46057 -85.68003 264 -5 A
## 3 06C Schaumburg Regional 41.98934 -88.10124 801 -6 A
This includes only rows of x and y that have values in both.
flights %>% inner_join(airlines) %>% head(3)
## Joining by: "carrier"
## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
## Source: local data frame [3 x 17]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr)
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl), name (fctr)
flights %>% inner_join(airlines) %>% select(distance:name) %>% head(3)
## Joining by: "carrier"
## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
## Source: local data frame [3 x 4]
##
## distance hour minute name
## (dbl) (dbl) (dbl) (fctr)
## 1 1400 5 17 United Air Lines Inc.
## 2 1416 5 33 United Air Lines Inc.
## 3 1089 5 42 American Airlines Inc.
This keeps all rows in x that have a match in y, but keeps only the variables (columns) from x. This essentially subsets the data frame x to those observations that also occur in y.
airports %>% left_join(faaflights) %>% head(2)
## Joining by: "faa"
## Source: local data frame [2 x 8]
##
## faa name lat lon alt tz dst
## (chr) (chr) (dbl) (dbl) (int) (dbl) (chr)
## 1 04G Lansdowne Airport 41.13047 -80.61958 1044 -5 A
## 2 06A Moton Field Municipal Airport 32.46057 -85.68003 264 -5 A
## Variables not shown: origin (chr)
airports %>% semi_join(faaflights) %>% head(2)
## Joining by: "faa"
## Source: local data frame [2 x 7]
##
## faa name lat lon alt tz dst
## (chr) (chr) (dbl) (dbl) (int) (dbl) (chr)
## 1 EWR Newark Liberty Intl 40.69250 -74.16867 18 -5 A
## 2 LGA La Guardia 40.77725 -73.87261 22 -5 A
This is the opposite of a semi-join. It keeps only the rows in x that do not have a matching value in y, and keeps only the columns from x.
faaflights %>% anti_join(airports)
## Joining by: "faa"
## Source: local data frame [0 x 2]
##
## Variables not shown: origin (chr), faa (chr)
airports %>% anti_join(faaflights)
## Joining by: "faa"
## Source: local data frame [1,394 x 7]
##
## faa name lat lon alt tz dst
## (chr) (chr) (dbl) (dbl) (int) (dbl) (chr)
## 1 ZYP Penn Station 40.75050 -73.99350 35 -5 A
## 2 ZWU Washington Union Station 38.89746 -77.00643 76 -5 A
## 3 ZWI Wilmington Amtrak Station 39.73667 -75.55167 0 -5 A
## 4 ZVE New Haven Rail Station 41.29867 -72.92599 7 -5 A
## 5 ZUN Black Rock 35.08323 -108.79178 6454 -7 A
## 6 ZTY Boston Back Bay Station 42.34780 -71.07500 20 -5 A
## 7 ZTF Stamford Amtrak Station 41.04694 -73.54149 0 -5 A
## 8 ZSY Scottsdale Airport 33.62289 -111.91053 1519 -7 A
## 9 ZSF Springfield Amtrak Station 42.10600 -72.59305 65 -5 A
## 10 ZRZ New Carrollton Rail Station 38.94800 -76.87190 39 -5 A
## .. ... ... ... ... ... ... ...
The purpose of tidying data is to make efficient use of R’s processing, which means being able to accomplish the desired steps with a relatively small amount of code. This makes for “tidy” coding. An additional way to make code efficient and readable is to use “pipes” to pass results of one process as arguments for the next. Stefan Milton Brache created the pipe %>%
for this purpose, and it is very useful in combination with the commands in tidyr
and dplyr
. The basic syntax is that %>%
passes the value or vector that precedes it to the function or expression that follows it. This avoids the use of nested parentheses that can be hard to follow (and type accurately), or the creation of multiple intermediate objects that hold values temporarily. Boehmke gives the following example:
### Nested Option:
arrange(
summarize(
filter(data, variable == numeric_value),
Total = sum(variable)
),
desc(Total)
)
### Multiple Object Option:
a <- filter(data, variable == numeric_value)
b <- summarise(a, Total = sum(variable))
c <- arrange(b, desc(Total))
### %>% Option:
data %>%
filter(variable == “value”) %>%
summarise(Total = sum(variable)) %>%
arrange(desc(Total))
With the commands available in the tidyr
and dplyr
packages, linked efficiently with pipes, one can produce data frames that are ready to make maximum use of the processing power of R.