This tutorial will cover the following functions in the dplyr
package:
Advanced topic
mutate()
In the first Data Manipulation tutorial we learned how to select just the columns we need by using the select()
function from dplyr
. But in some cases we may want to add a new column. For example, in the chicago_air
data set from the region5air
package, we may want a column that is a one day lag value of the ozone column. To add a new column we use the mutate()
function from the dplyr
package. Again, the first parameter is the data frame. Then we use the lag()
function to create a new column of lagged ozone values.
library(region5air)
library(dplyr)
data(chicago_air)
head(chicago_air)
## date ozone temp solar month weekday
## 1 2013-01-01 0.032 17 0.65 1 3
## 2 2013-01-02 0.020 15 0.61 1 4
## 3 2013-01-03 0.021 28 0.17 1 5
## 4 2013-01-04 0.028 18 0.62 1 6
## 5 2013-01-05 0.025 26 0.48 1 7
## 6 2013-01-06 0.026 36 0.47 1 1
d <- mutate(chicago_air, ozone_lag1 = lag(ozone, n = 1))
head(d)
## date ozone temp solar month weekday ozone_lag1
## 1 2013-01-01 0.032 17 0.65 1 3 NA
## 2 2013-01-02 0.020 15 0.61 1 4 0.032
## 3 2013-01-03 0.021 28 0.17 1 5 0.020
## 4 2013-01-04 0.028 18 0.62 1 6 0.021
## 5 2013-01-05 0.025 26 0.48 1 7 0.028
## 6 2013-01-06 0.026 36 0.47 1 1 0.025
We may want to also calculate the difference between the ozone value and a standard value.
d <- mutate(chicago_air, ozone_lag1 = lag(ozone, n = 1), diff_standard = ozone - 0.075)
head(d)
## date ozone temp solar month weekday ozone_lag1 diff_standard
## 1 2013-01-01 0.032 17 0.65 1 3 NA -0.043
## 2 2013-01-02 0.020 15 0.61 1 4 0.032 -0.055
## 3 2013-01-03 0.021 28 0.17 1 5 0.020 -0.054
## 4 2013-01-04 0.028 18 0.62 1 6 0.021 -0.047
## 5 2013-01-05 0.025 26 0.48 1 7 0.028 -0.050
## 6 2013-01-06 0.026 36 0.47 1 1 0.025 -0.049
group_by()
and filter()
So far we have selected columns with select()
, added new columns with mutate()
, and filtered an entire data frame with filter()
.
But sometimes when we filter we don’t want to filter the entire data frame all at once. Suppose we are interested in the high values of ozone in the chicago_air
data frame. We can use the max()
function to find the highest value.
filter(chicago_air, ozone == max(ozone, na.rm = TRUE))
## date ozone temp solar month weekday
## 1 2013-05-14 0.081 74 1.4 5 3
But that’s for the entire year. Suppose you wanted to know the maximum value for each month. You could subset chicago_air
12 times and make a new data frame for each month. Then find the maximum value for each data frame. But that would be very inefficient. What we really want to do is filter the data frame by first grouping all of the rows by month, then filtering on the highest ozone value in each month.
Below we create a data frame named ozone
for illustration. It contains ozone values for one monitor, but for simplicity’s sake it only has two days in June and two days in July.
ozone <- read.table(header=T, text='
day month value unit
1 June 0.040 ppm
2 June 0.050 ppm
1 July 0.045 ppm
2 July 0.052 ppm
')
If we use the filter()
function on ozone
to get the maximum value, we’ll get one row.
filter(ozone, value == max(value))
## day month value unit
## 1 2 July 0.052 ppm
If we wanted to get the maximum value for each month, we would first use the group_by()
function.
ozone <- group_by(ozone, month)
Now the ozone
data frame is grouped by month. If we use the filter()
function now, we’ll get two rows because there are two months.
filter(ozone, value == max(value))
## Source: local data frame [2 x 4]
## Groups: month [2]
##
## day month value unit
## (int) (fctr) (dbl) (fctr)
## 1 2 June 0.050 ppm
## 2 2 July 0.052 ppm
What if we had two columns we wanted to group by? Here is another simple data frame to illustrate this situation.
ozone_8h_1h <- read.table(header=T, text='
day month duration value unit
1 June 1h 0.050 ppm
1 June 8h 0.040 ppm
2 June 1h 0.060 ppm
2 June 8h 0.050 ppm
1 July 1h 0.055 ppm
1 July 8h 0.045 ppm
2 July 1h 0.065 ppm
2 July 8h 0.052 ppm
')
Now we might want to find the maximum value by month and by duration. We simply add duration in the group_by()
function.
ozone_8h_1h <- group_by(ozone_8h_1h, month, duration)
Now the filter()
function will return 4 rows, one for each month/duration pair.
filter(ozone_8h_1h, value == max(value))
## Source: local data frame [4 x 5]
## Groups: month, duration [4]
##
## day month duration value unit
## (int) (fctr) (fctr) (dbl) (fctr)
## 1 2 June 1h 0.060 ppm
## 2 2 June 8h 0.050 ppm
## 3 2 July 1h 0.065 ppm
## 4 2 July 8h 0.052 ppm
group_by()
and summarize()
Filtering by grouped columns is appropriate for situations where we want to subset down to rows that meet a logical criteria. But sometimes we want to apply a summarizing function to grouped values.
Let’s say we want to find the mean for each month and duration in the ozone_8h_1h
data frame. We’ve use the group_by()
function already, ozone_8h_1h <- group_by(ozone_8h_1h, month, value)
. We can’t use filter()
because there would be no value that satisfies the logical expression value == mean(value)
. Instead, we use the summarize()
function.
summarize(ozone_8h_1h, mean = mean(value))
## Source: local data frame [4 x 3]
## Groups: month [?]
##
## month duration mean
## (fctr) (fctr) (dbl)
## 1 July 1h 0.0600
## 2 July 8h 0.0485
## 3 June 1h 0.0550
## 4 June 8h 0.0450
As you can see, the column “units” has been dropped. When you use summarize()
you will only keep the columns that have been grouped-by and the column you create with the summarizing function. All other columns are dropped.
filter()
and when should you use summarise()
?Use filter()
with group_by()
when you want to keep only the rows that meet a logical criteria among groups. You will be using ==
because it’s a logical filter. The output will retain all of the columns.
Use summarize()
with group_by()
when you need a summary of one column for each group. You will use =
because you are creating a new column with a summarizing function. The output will only retain the columns you grouped-by and the one column you created with the summarizing function.
Exercises for this tutorial can be found here: http://rpubs.com/NateByers/DataManip2Exercises.
%>%
dplyr
takes advantage of the pipe operator, %>%
, from the magrittr
package.
The %>%
is a symbol you can use to chain dplyr
functions together. This is convenient because it eliminates the necessity of creating new data frames, or re-assigning values to the same data frame.
Here’s an example of a data manipulation process.
data(airdata)
# filter down to ozone
ozone <- filter(airdata, parameter == "44201")
# select only certain columns
ozone <- select(ozone, site, datetime, duration, value, unit, poc)
# group by certain columns
ozone <- group_by(ozone, site, duration)
# find mean value for each group
ozone <- summarize(ozone, mean = mean(value, na.rm = FALSE))
head(ozone)
## Source: local data frame [6 x 3]
## Groups: site [6]
##
## site duration mean
## (chr) (int) (dbl)
## 1 840170310001 60 0.02534631
## 2 840170310032 60 0.03259597
## 3 840170310064 60 0.02392692
## 4 840170310076 60 0.02375103
## 5 840170311003 60 0.02771658
## 6 840170311601 60 0.02588920
With %>%
you can start with the data frame and then write each dplyr
function without supplying the data frame.
ozone <- airdata %>%
filter(parameter == "44201") %>%
select(site, datetime, duration, value, unit, poc) %>%
group_by(site, duration) %>%
summarize(mean = mean(value, na.rm = FALSE))
head(ozone)
## Source: local data frame [6 x 3]
## Groups: site [6]
##
## site duration mean
## (chr) (int) (dbl)
## 1 840170310001 60 0.02534631
## 2 840170310032 60 0.03259597
## 3 840170310064 60 0.02392692
## 4 840170310076 60 0.02375103
## 5 840170311003 60 0.02771658
## 6 840170311601 60 0.02588920