It is rare that you get the data in exactly the right form you need. Often you’ll need to manipulate your data and get ready for your analysis. In this chapter we will learn how to do this using dplyr package and a new dataset on flights departing New York City in 2013.
library(dplyr); library(ggplot2); library(nycflights13)To explore the basic data manipulation verbs of dplyr, we’ll use nycflights13::flights. This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights.
flights## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <time>
The basic dplyr functions are:
filter() and slice().arrange().select() and rename().mutate() and transmute().summarise().All of them can be combined with group_by() to work on subsets of variables instead of their entirety.
Let’s dive in and see how these verbs work.
filter() subsets observations based on logical conditions. The first argument is the data frame. The second and subsequent arguments are logical expressions that filter the data frame.
Example: select all flights on January 1st:
filter(flights, month == 1 & day == 1)## # A tibble: 842 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <time>
Dplyr returns a new filtered data frame. To save the result, use the assignment operator, <-:
jan1 <- filter(flights, month==1 & day==1)In the previous example we used the logical and operator, &, to filter the rows. This means: select the rwos where both conditions are true.
Other logical operators are:
Or, |: select rows where any of the conditions is true
Not, !: select rows where the condition is not true
Group membership %in%: select rows that match a specific group of values of a column
Example:
Find all flights that departed in November or December
Solution:
(dec.or.nov <- filter(flights, month == 11 | month == 12))## # A tibble: 55,403 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 11 1 5 2359 6 352
## 2 2013 11 1 35 2250 105 123
## 3 2013 11 1 455 500 -5 641
## 4 2013 11 1 539 545 -6 856
## 5 2013 11 1 542 545 -3 831
## 6 2013 11 1 549 600 -11 912
## 7 2013 11 1 550 600 -10 705
## 8 2013 11 1 554 600 -6 659
## 9 2013 11 1 554 600 -6 826
## 10 2013 11 1 554 600 -6 749
## # ... with 55,393 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <time>
To see that it selected only the flights in November or December:
unique(dec.or.nov$month)## [1] 11 12
Note: when using & or |, you have to repeat the name of the variable like we did: month == 11 | month == 12, not: month == 11 | 12.
Note: to check for equality, we use the double equal operator: ==, not the single one: =.
Example: Find all flights that departed from any airport other than ‘JFK’
no_jfk <- filter(flights, origin != 'JFK')
# check if no_jfk contains 'JFK' in the `origin`
'JFK' %in% no_jfk$origin## [1] FALSE
Example: Find all flights that belong to ‘UA’, ‘B6’, ‘DL’ airlines, i.e: carrier
filter(flights, carrier %in% c('UA', 'B6', 'DL'))## # A tibble: 161,410 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 544 545 -1 1004
## 4 2013 1 1 554 600 -6 812
## 5 2013 1 1 554 558 -4 740
## 6 2013 1 1 555 600 -5 913
## 7 2013 1 1 557 600 -3 838
## 8 2013 1 1 558 600 -2 849
## 9 2013 1 1 558 600 -2 853
## 10 2013 1 1 558 600 -2 924
## # ... with 161,400 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <time>
# note: you could also use the logical OR operator `|` here. slice() subsets rows based on their position (index).Example: Get the 2nd five rows
slice(flights, 6:10)## # A tibble: 5 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 554 558 -4 740
## 2 2013 1 1 555 600 -5 913
## 3 2013 1 1 557 600 -3 709
## 4 2013 1 1 557 600 -3 838
## 5 2013 1 1 558 600 -2 753
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <time>
Example: Get the first row
slice(flights, 1)## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <time>
Example: Get the last row
slice(flights, n())## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 9 30 NA 840 NA NA
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <time>
Note: the function n() returns the number of rows in the dataset.
top_n() selects and orde rs top (or bottom) n entries. If you use it with the entire data frame, it’ll select the top (or bottom) entries from the entirety of the dataframe. If used with group_by, it’ll select the top (or bottom) entries from each group. (more on group_by later in the chapter).Example: select the top 5 flights ordered by dep_delay. i.e: the 5 flights with the highest dep_delay
top_n(flights, 5, dep_delay)## # A tibble: 5 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 1 10 1121 1635 1126 1239
## 3 2013 6 15 1432 1935 1137 1607
## 4 2013 7 22 845 1600 1005 1044
## 5 2013 9 20 1139 1845 1014 1457
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <time>
Example: select the bottom 5 flights ordered by arr_delay. i.e: the 5 flights with the lowest arr_delay
top_n(flights, -5, arr_delay)## # A tibble: 5 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 5 2 1947 1949 -2 2209
## 2 2013 5 4 1816 1820 -4 2017
## 3 2013 5 6 1826 1830 -4 2045
## 4 2013 5 7 1715 1729 -14 1944
## 5 2013 5 20 719 735 -16 951
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <time>
| Logic in R | ?Comparison | ?Logic | |
|---|---|---|---|
| < | Less than | != | Not equal to |
| > | Greater than | %in% | Group membership |
| == | Equal to | is.na | Is NA |
| <= | Less than or equal to | !is.na | Is not NA |
| >= | Greater than or equal to | &, | ,!,xor,any,all |
Example: Find all flights where arr_delay is not NA
filter(flights, !is.na(arr_delay))## # A tibble: 327,346 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 327,336 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <time>
distinct(flights): Remove duplicate rows.
sample_n(flights, 10, replace = TRUE): Randomly select n rows.
sample_frac(flights, 0.5, replace = TRUE): Randomly select fraction of rows.
%>%So far we’ve been passing the dataframe as a first argument to dplyr functions. This is to tell the function that we’re now working inside the dataframe, so that when we use the variable name it knows that it’s a variable of the dataframe.
We can accomplish the same objective using %>%. For example, our very first example: filter(flights, month == 1 & day == 1) can be re-written as: flights %>% filter(month == 1 & day == 1). The %>% operator tells filter that we’re now inside flights.
We will be using this syntax from now on, because it’s much cleaner and easier to read, especially when you combine multiple operations, as we’ll see later in the chapter.
Later (I think it’s better to include it in the basics chapter at the very beginning)
arrange() reorder rows according to the values of a column, or a gorup of columns. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
Example: Arrange the flights by year, break ties by month, and break ties again by day
dplyr::arrange(flights, year, month, day)## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <time>
Note: you don’t see the tie breaking here, because only a small piece of the result if printed. To see the whole result:
break_ties <- arrange(flights, year, month, day)
View(break_ties) # note the letter 'v' in 'View' is capitalizedUse desc() to re-order by a column in descending order
Example: Arrange flights ascendingly by dep_time, and descendingly by dep_delay
arrange(flights, dep_time, desc(dep_delay))## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 4 10 1 1930 271 106
## 2 2013 5 22 1 1935 266 154
## 3 2013 6 24 1 1950 251 105
## 4 2013 7 1 1 2029 212 236
## 5 2013 1 31 1 2100 181 124
## 6 2013 2 11 1 2100 181 111
## 7 2013 3 18 1 2128 153 247
## 8 2013 6 25 1 2130 151 249
## 9 2013 2 24 1 2245 76 121
## 10 2013 1 13 1 2249 72 108
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <time>
flights %>%
summarise(sum((arr_delay + dep_delay) < 0, na.rm=T))## # A tibble: 1 x 1
## sum((arr_delay + dep_delay) < 0, na.r...
## <int>
## 1 188401
Often you work with large datasets with many columns but only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:
Example: Select the year, month, day from the flights data set
flights %>% select(year, month, day)## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## # ... with 336,772 more rows
Example: Select all columns between year and day (inclusive)
flights %>% select(year:day)## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## # ... with 336,772 more rows
Example: Select all columns except those from year to day (inclusive)
flights %>% select(-(year:day))## # A tibble: 336,776 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 517 515 2 830 819 11
## 2 533 529 4 850 830 20
## 3 542 540 2 923 850 33
## 4 544 545 -1 1004 1022 -18
## # ... with 336,772 more rows, and 10 more variables: carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <time>
There are a number of helper functions you can use within select():
starts_with("abc"): matches names that begin with “abc”.
ends_with("xyz"): matches names that end with “xyz”.
contains("ijk"): matches names that contain “ijk”.
matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
num_range("x", 1:3): matches x1, x2 and x3.
See ?select for more details.
To select columns and renaming them: use rename() instead of select()
flights %>% rename(which_year = year)## # A tibble: 336,776 x 19
## which_year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## # ... with 336,772 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <time>
Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.
flights %>% select(time_hour, air_time, everything())## # A tibble: 336,776 x 19
## time_hour air_time year month day dep_time sched_dep_time
## <time> <dbl> <int> <int> <int> <int> <int>
## 1 2013-01-01 05:00:00 227 2013 1 1 517 515
## 2 2013-01-01 05:00:00 227 2013 1 1 533 529
## 3 2013-01-01 05:00:00 160 2013 1 1 542 540
## 4 2013-01-01 05:00:00 183 2013 1 1 544 545
## # ... with 336,772 more rows, and 12 more variables: dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>
summariseNow we have the ability to filter rows and select columns of interest, we usually want to perform some computations on them by applying summary functions. In general, we have 4 cases:
Example: How many trips have had arr_delay less than zero?
flights %>%
filter(arr_delay < 0) %>%
summarize(n())## # A tibble: 1 x 1
## n()
## <int>
## 1 188933
Notice how the %>% operator comes in handy here. It tells the natural flow of thinking about the problem:
I first want to filter my observations (rows) such that it only contains the flights matching the condition
Then I use summarise() to do computation on the resulted dataframe from the previous step
The computation I need here is just count the number rows; which is achieved by the n() helper function.
Let’s look at a slightly more complicated problem:
Example: calculate the mean, standard deviation, and the number of flights with dep_delay less than 20
# narrate the problem:
# 1- select the rows with less than 20 dep_delay
# 2- do the required computation using summarise
flights %>%
filter(dep_delay < 20) %>%
summarise(dep_delay_mean = mean(dep_delay),
dep_delay_std = sd(dep_delay),
count = n()) ## # A tibble: 1 x 3
## dep_delay_mean dep_delay_std count
## <dbl> <dbl> <int>
## 1 -1.549192 6.415418 265184
Note that dep_delay_mean, dep_delay_std and count are optional names I gave to the columns of the returned table. I could have just typed:
flights %>%
filter(dep_delay < 20) %>%
summarise(mean(dep_delay),
sd(dep_delay),
n()) ## # A tibble: 1 x 3
## mean(dep_delay) sd(dep_delay) n()
## <dbl> <dbl> <int>
## 1 -1.549192 6.415418 265184
In the previous examples, we performed compuations on one variable. To do the same computation on more than one variable simaltaneously, use summarise_each():
Example: compute the variance of dep_delay and arr_delay, for all flights from JFK airport
flights %>%
filter(origin == 'JFK') %>%
summarise_each(funs(var), dep_delay, arr_delay)## # A tibble: 1 x 2
## dep_delay arr_delay
## <dbl> <dbl>
## 1 NA NA
flights %>%
filter(origin == 'JFK') %>%
summarise_each(funs(mean, sd), dep_delay, arr_delay)## # A tibble: 1 x 4
## dep_delay_mean arr_delay_mean dep_delay_sd arr_delay_sd
## <dbl> <dbl> <dbl> <dbl>
## 1 NA NA NaN NaN
Note: dplyr automatically named the output variables for us. If we want to give them custom names:
flights %>%
filter(origin == 'JFK') %>%
summarise_each(funs(mean, sd), dep_delay, arr_delay) %>%
setNames(c('avg_dep_dealy', 'std_dep_delay', 'avg_arr_delay', 'std_arr_delay'))## # A tibble: 1 x 4
## avg_dep_dealy std_dep_delay avg_arr_delay std_arr_delay
## <dbl> <dbl> <dbl> <dbl>
## 1 NA NA NaN NaN
For more illustration: Aggregation with dplyr: summarise and summarise_each
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate() and transmute()
mutate()mutate() always adds new columns at the end of your dataset, so we’ll start by creating a narrower dataset so we can see the new variables.
flights_sml <- flights %>%
select(year:day, ends_with('delay'), distance, air_time)
flights_sml %>%
mutate(gain = arr_delay - dep_delay,
speed = distance / air_time * 60)## # A tibble: 336,776 × 9
## year month day dep_delay arr_delay distance air_time gain speed
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 9 370.0441
## 2 2013 1 1 4 20 1416 227 16 374.2731
## 3 2013 1 1 2 33 1089 160 31 408.3750
## 4 2013 1 1 -1 -18 1576 183 -17 516.7213
## # ... with 336,772 more rows
Note that you can refer to columns that you’ve just created:
flights_sml %>%
mutate(gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)## # A tibble: 336,776 × 10
## year month day dep_delay arr_delay distance air_time gain hours
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 9 3.783333
## 2 2013 1 1 4 20 1416 227 16 3.783333
## 3 2013 1 1 2 33 1089 160 31 2.666667
## 4 2013 1 1 -1 -18 1576 183 -17 3.050000
## # ... with 336,772 more rows, and 1 more variables: gain_per_hour <dbl>
If you only want to keep the new variables, use:
transmute()flights %>%
transmute(gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)## # A tibble: 336,776 × 3
## gain hours gain_per_hour
## <dbl> <dbl> <dbl>
## 1 9 3.783333 2.378855
## 2 16 3.783333 4.229075
## 3 31 2.666667 11.625000
## 4 -17 3.050000 -5.573770
## # ... with 336,772 more rows