The Idea

The dplyr package is a package developed by Hadley Wickham. It is widely trusted and is updated regularly. Thus, it is an excellent tool for your data science tool box. dplyr is used to help solve data manipulation challenges and because it was built for the sole purpose of simplifying the process of manipulating, sorting, summarizing and joining data frames, code created in this way tends to flow together and be less choppy than if one cobbled together other R functions.

Fundamental tools in dplyr include:

Tool Function
mutate() creates new variables from old ones.
select() selects variables based on their names.
filter() provides basic filtering capabilities.
summarise() summarizes the data by functions of your choice.
arrange() changes the ordering of the rows.
join() joins separate dataframes.
group_by() groups data by categorical level.

A good reference for this material is the data transformation chapter in R for data science. A cheat sheet is also available.

library(tidyverse)
library(dplyr)

We will be using this dataset. Download and import the data.

surveys <- read_csv("https://raw.githubusercontent.com/deepanshu88/data/master/sampledata.csv")
head(surveys)
## # A tibble: 6 x 16
##   Index State  Y2002  Y2003  Y2004  Y2005  Y2006  Y2007  Y2008  Y2009  Y2010
##   <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 A     Alab~ 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6
## 2 A     Alas~ 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6
## 3 A     Ariz~ 1.74e6 1.97e6 1.38e6 1.78e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6
## 4 A     Arka~ 1.49e6 1.99e6 1.12e6 1.95e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6
## 5 C     Cali~ 1.69e6 1.68e6 1.89e6 1.48e6 1.74e6 1.81e6 1.49e6 1.66e6 1.62e6
## 6 C     Colo~ 1.34e6 1.88e6 1.89e6 1.24e6 1.87e6 1.81e6 1.88e6 1.75e6 1.91e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## #   Y2014 <dbl>, Y2015 <dbl>

The Pipe %>%

While the pipe operator is not technically a requirement for the tidyverse or dplyr, it certainly increases the ease with which we make use of the tools.

At a basic level, the pipe forwards a value or the result of an expression into the next function. We can then omit the data call in the function. For example: \[\text{filter(data, variable == value)}\] can also be written as \[\text{data %>% filter(variable == value)}.\] While that does not seem to be a substantial change or even an upgrade at all, more complex strings of code will become increasingly simple with this tool. More on this later.

select()

In this example, we select the survey data from 2003 only. We want our final data set to have the state name with the 2003 data. So, we just list the 2 column names we want to keep.

new_dat <- surveys %>%
  select(State, Y2003)
head(new_dat,5)
## # A tibble: 5 x 2
##   State        Y2003
##   <chr>        <dbl>
## 1 Alabama    1317711
## 2 Alaska     1960378
## 3 Arizona    1968140
## 4 Arkansas   1994927
## 5 California 1675807

We can also use select() to drop variables. Here we drop 2003 to 2005 data.

new_dat <- surveys %>%
  select(-c(Y2003:Y2005))
head(new_dat,5)
## # A tibble: 5 x 13
##   Index State  Y2002  Y2006  Y2007  Y2008  Y2009  Y2010  Y2011  Y2012  Y2013
##   <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 A     Alab~ 1.30e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6 1.44e6 1.19e6 1.85e6
## 2 A     Alas~ 1.17e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6 1.23e6 1.51e6 1.99e6
## 3 A     Ariz~ 1.74e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6 1.13e6 1.91e6 1.36e6
## 4 A     Arka~ 1.49e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6 1.93e6 1.22e6 1.59e6
## 5 C     Cali~ 1.69e6 1.74e6 1.81e6 1.49e6 1.66e6 1.62e6 1.64e6 1.92e6 1.16e6
## # ... with 2 more variables: Y2014 <dbl>, Y2015 <dbl>

A number of features also help select based on variable names.

Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02, x03.
one_of() Variables in character vector.
everything() All variables.

If we wanted to keep the year columns, we could specify either starts_with(“Y”) or contains(“20”).

new_dat <- surveys %>%
  select(contains("20"))
head(new_dat,5)
## # A tibble: 5 x 14
##    Y2002  Y2003  Y2004  Y2005  Y2006  Y2007  Y2008  Y2009  Y2010  Y2011  Y2012
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6 1.44e6 1.19e6
## 2 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6 1.23e6 1.51e6
## 3 1.74e6 1.97e6 1.38e6 1.78e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6 1.13e6 1.91e6
## 4 1.49e6 1.99e6 1.12e6 1.95e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6 1.93e6 1.22e6
## 5 1.69e6 1.68e6 1.89e6 1.48e6 1.74e6 1.81e6 1.49e6 1.66e6 1.62e6 1.64e6 1.92e6
## # ... with 3 more variables: Y2013 <dbl>, Y2014 <dbl>, Y2015 <dbl>

filter()

In order to specify a subset of the data, we use the filter() function. In this example, we filter the survey data with Index “A”.

new_dat <- surveys %>%
  filter(Index=="A")
head(new_dat,5)
## # A tibble: 4 x 16
##   Index State  Y2002  Y2003  Y2004  Y2005  Y2006  Y2007  Y2008  Y2009  Y2010
##   <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 A     Alab~ 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6
## 2 A     Alas~ 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6
## 3 A     Ariz~ 1.74e6 1.97e6 1.38e6 1.78e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6
## 4 A     Arka~ 1.49e6 1.99e6 1.12e6 1.95e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## #   Y2014 <dbl>, Y2015 <dbl>

Logical connectives have the obvious interpretation. The connectives and (“&”) and or (“|”) can also be used. The exclamation point is used as the symbol for not (“!”).

Symbol Meaning Symbol Meaning
\(<\) Less than \(!=\) Not equal to
\(>\) Greater than \(\%in\%\) Group membership
\(==\) Equal to \(is.na\) is NA
\(<=\) Less than or equal to \(!is.na\) is not NA
\(>=\) Greater than or equal to \(\&,|,!\) Boolean operators (and, or, not)

In this example, we filter out data with Index “A” or “C”.

new_dat <- surveys %>%
  filter(Index %in% c("A","C"))
head(new_dat,5)
## # A tibble: 5 x 16
##   Index State  Y2002  Y2003  Y2004  Y2005  Y2006  Y2007  Y2008  Y2009  Y2010
##   <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 A     Alab~ 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6
## 2 A     Alas~ 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6
## 3 A     Ariz~ 1.74e6 1.97e6 1.38e6 1.78e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6
## 4 A     Arka~ 1.49e6 1.99e6 1.12e6 1.95e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6
## 5 C     Cali~ 1.69e6 1.68e6 1.89e6 1.48e6 1.74e6 1.81e6 1.49e6 1.66e6 1.62e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## #   Y2014 <dbl>, Y2015 <dbl>

Next, we want survey data with 2003 data between 1300000 and 1500000.

new_dat <- surveys %>%
  filter(Y2003 > 1300000, Y2003 < 1500000)
head(new_dat,5)
## # A tibble: 5 x 16
##   Index State  Y2002  Y2003  Y2004  Y2005  Y2006  Y2007  Y2008  Y2009  Y2010
##   <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 A     Alab~ 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6
## 2 F     Flor~ 1.96e6 1.47e6 1.42e6 1.36e6 1.34e6 1.28e6 1.76e6 1.82e6 1.20e6
## 3 I     Idaho 1.35e6 1.44e6 1.74e6 1.54e6 1.12e6 1.77e6 1.34e6 1.75e6 1.44e6
## 4 I     Iowa  1.50e6 1.44e6 1.58e6 1.39e6 1.55e6 1.45e6 1.32e6 1.15e6 1.75e6
## 5 K     Kent~ 1.81e6 1.45e6 1.80e6 1.25e6 1.14e6 1.91e6 1.30e6 1.96e6 1.35e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## #   Y2014 <dbl>, Y2015 <dbl>

If we are looking for data that contains a string, we can filter using the grepl command. In this example, we filter any data value in the 2002 column that contains the string “11”.

new_dat <- surveys %>%
  filter(grepl("11", Y2002))
head(new_dat,5)
## # A tibble: 5 x 16
##   Index State  Y2002  Y2003  Y2004  Y2005  Y2006  Y2007  Y2008  Y2009  Y2010
##   <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 A     Alas~ 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6
## 2 D     Dist~ 1.11e6 1.99e6 1.37e6 1.83e6 1.80e6 1.60e6 1.19e6 1.74e6 1.71e6
## 3 N     Neva~ 1.43e6 1.11e6 1.12e6 1.76e6 1.69e6 1.77e6 1.90e6 1.23e6 1.53e6
## 4 O     Okla~ 1.17e6 1.33e6 1.66e6 1.80e6 1.31e6 1.71e6 1.26e6 1.85e6 1.67e6
## 5 S     Sout~ 1.16e6 1.15e6 1.66e6 1.42e6 1.42e6 1.28e6 1.17e6 1.85e6 1.55e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## #   Y2014 <dbl>, Y2015 <dbl>

Just to verify the point, here is the resulting 2002 column and associated state name only. Each entry in the 2002 column has consecutive 1’s.

head(new_dat,5)[,c(2,3)]
## # A tibble: 5 x 2
##   State                  Y2002
##   <chr>                  <dbl>
## 1 Alaska               1170302
## 2 District of Columbia 1111437
## 3 Nevada               1426117
## 4 Oklahoma             1173918
## 5 South Dakota         1159037

summarise()

The summarise() function helps us create a summary statistic for a particular variable. For example, we find the mean and median for 2002.

new_dat <- surveys %>%
  summarise(Y2002_Mean = mean(Y2002), Y2002_Median = median(Y2002))
new_dat
## # A tibble: 1 x 2
##   Y2002_Mean Y2002_Median
##        <dbl>        <dbl>
## 1   1566034.      1584734

If we want to summarize multiple years, we use the vars command along with a list of functions. Note that we are using the summarise_at function rather than summarise.

new_dat <- surveys %>%
  summarise_at(vars(Y2002:Y2007), list(Mean = mean, Median = median))
head(new_dat,5)
## # A tibble: 1 x 12
##   Y2002_Mean Y2003_Mean Y2004_Mean Y2005_Mean Y2006_Mean Y2007_Mean Y2002_Median
##        <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>        <dbl>
## 1   1566034.   1509193.   1540555.   1522064.   1530969.   1553219.      1584734
## # ... with 5 more variables: Y2003_Median <dbl>, Y2004_Median <dbl>,
## #   Y2005_Median <dbl>, Y2006_Median <dbl>, Y2007_Median <dbl>

For non-standard functions, we can create a function inside the summarise_at function. For example, if I wanted to display the data from 2002 and 2003 in terms of millions (ie, 1566034 displayed as 1.566034).

new_dat <- surveys %>%
  summarise_at(vars(Y2002,Y2003), function(x) (x/1000000))
head(new_dat,5)
## # A tibble: 5 x 2
##   Y2002 Y2003
##   <dbl> <dbl>
## 1  1.30  1.32
## 2  1.17  1.96
## 3  1.74  1.97
## 4  1.49  1.99
## 5  1.69  1.68

arrange()

The arrange function allows us to sort data. For example, in descending (desc) or ascending (default) order of one of the variables. Here, we arrange the data in descending order by the 2002 column.

new_dat <- surveys %>%
  arrange(desc(Y2002))
head(new_dat,5)
## # A tibble: 5 x 16
##   Index State  Y2002  Y2003  Y2004  Y2005  Y2006  Y2007  Y2008  Y2009  Y2010
##   <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 M     Miss~ 1.98e6 1.29e6 1.63e6 1.94e6 1.35e6 1.73e6 1.43e6 1.57e6 1.38e6
## 2 W     Wash~ 1.98e6 1.69e6 1.20e6 1.16e6 1.33e6 1.62e6 1.55e6 1.56e6 1.18e6
## 3 F     Flor~ 1.96e6 1.47e6 1.42e6 1.36e6 1.34e6 1.28e6 1.76e6 1.82e6 1.20e6
## 4 G     Geor~ 1.93e6 1.54e6 1.81e6 1.78e6 1.33e6 1.22e6 1.77e6 1.63e6 1.15e6
## 5 N     Nebr~ 1.89e6 1.31e6 1.43e6 1.24e6 1.50e6 1.28e6 1.14e6 1.27e6 1.13e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## #   Y2014 <dbl>, Y2015 <dbl>

We can sort by multiple variables by specifying them in order. Here, we first sort by Index and then any ties are broken by the 2002 data.

new_dat <- surveys %>%
  arrange(Index, Y2002)
head(new_dat,5)
## # A tibble: 5 x 16
##   Index State  Y2002  Y2003  Y2004  Y2005  Y2006  Y2007  Y2008  Y2009  Y2010
##   <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 A     Alas~ 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6
## 2 A     Alab~ 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6
## 3 A     Arka~ 1.49e6 1.99e6 1.12e6 1.95e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6
## 4 A     Ariz~ 1.74e6 1.97e6 1.38e6 1.78e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6
## 5 C     Colo~ 1.34e6 1.88e6 1.89e6 1.24e6 1.87e6 1.81e6 1.88e6 1.75e6 1.91e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## #   Y2014 <dbl>, Y2015 <dbl>

group_by()

The group by function allows for groupings by a categorical variable. It is a silent variable in that applying it to the data by itself does not provide any observable manipulation. Rather, the change you see is when we apply the function in conjunction with other functions. In this example, we find the mean and median of the data from 2003 – 2005 grouped by Index.

new_dat <- surveys %>%
  group_by(Index) %>%
  summarise_at(vars(Y2003:Y2005), list(mean=mean, median=median))
head(new_dat,5)
## # A tibble: 5 x 7
##   Index Y2003_mean Y2004_mean Y2005_mean Y2003_median Y2004_median Y2005_median
##   <chr>      <dbl>      <dbl>      <dbl>        <dbl>        <dbl>        <dbl>
## 1 A        1810289   1358400.   1667653.      1964259      1248441      1637391
## 2 C        1595708   1652556    1411970       1675807      1886149      1480280
## 3 D        1631207   1540697    1615854       1631207      1540697      1615854
## 4 F        1468852   1419738    1362787       1468852      1419738      1362787
## 5 G        1541565   1810773    1779091       1541565      1810773      1779091

You will note that we used multiple pipes in the preceding example. First, the surveys data in grouped by the variable Index. Then, the grouped data is sent to the summarise_at function.

mutate()

In order to create a new variable, we use the mutate function. For example, suppose we want to find the change in values from 2002 to 2015. We will display only the Index, State, Y2002, Y2015 and absolute change in our final table.

new_dat <- surveys %>%
  mutate(change = Y2015 - Y2002) %>%
  select(Index, State, Y2002, Y2015, change)
head(new_dat,5)
## # A tibble: 5 x 5
##   Index State        Y2002   Y2015  change
##   <chr> <chr>        <dbl>   <dbl>   <dbl>
## 1 A     Alabama    1296530 1916661  620131
## 2 A     Alaska     1170302 1979143  808841
## 3 A     Arizona    1742027 1647724  -94303
## 4 A     Arkansas   1485531 1329341 -156190
## 5 C     California 1685349 1644607  -40742

join()

The join function is a way to join two data sets.

Join Function Description
inner_join(x, y, by = ) Include only rows in both x and y that have a matching value
left_join(x, y, by = ) Include all of x, and matching rows of y
right_join(x, y, by = ) Include all of y, and matching rows of x
full_join(x, y, by = ) Include all rows in both x and y
semi_join(x, y, by = ) Include rows of x that match y but only keep the columns from x
anti_join(x, y, by = ) Opposite of semi_join

We invent a data set to illustrate the join function. Below, we define 2 data frames.

df1=data.frame(ID = c(1, 2, 3, 4, 5),
               w=c('a', 'b', 'c', 'd', 'e'),
               x=c(1, 1, 0, 0, 1),
               y=rnorm(5),
               z=letters[1:5]
               )
df1
##   ID w x           y z
## 1  1 a 1  0.72875246 a
## 2  2 b 1 -1.14203906 b
## 3  3 c 0 -0.09441652 c
## 4  4 d 0  0.61865242 d
## 5  5 e 1  0.24691825 e
df2=data.frame(ID = c(1, 7, 3, 6, 8),
               a=c('z', 'b', 'k', 'd', 'l'),
               b=c(1, 2, 3, 0, 4),
               c=rnorm(5),
               d=letters[2:6]
               )
df2
##   ID a b          c d
## 1  1 z 1 0.28661997 b
## 2  7 b 2 2.09788075 c
## 3  3 k 3 0.52568970 d
## 4  6 d 0 0.09773865 e
## 5  8 l 4 0.08717624 f

We first begin joining common rows of a table (inner_join). In this case, row ID 1 and 3 are common and so will be joined. If the primary key (the by = "" piece) is not the same, try by = c(“ID1”, “ID2”).

inner_join(df1, df2, by = "ID")
##   ID w x           y z a b         c d
## 1  1 a 1  0.72875246 a z 1 0.2866200 b
## 2  3 c 0 -0.09441652 c k 3 0.5256897 d

A left join will maintain all of the rows from the left table (based on the ID column). It will add values from the right table and encode NA’s where no match exists.

left_join(df1, df2, by = "ID")
##   ID w x           y z    a  b         c    d
## 1  1 a 1  0.72875246 a    z  1 0.2866200    b
## 2  2 b 1 -1.14203906 b <NA> NA        NA <NA>
## 3  3 c 0 -0.09441652 c    k  3 0.5256897    d
## 4  4 d 0  0.61865242 d <NA> NA        NA <NA>
## 5  5 e 1  0.24691825 e <NA> NA        NA <NA>

Here we get the opposite (use the right ID values), but note the order of columns.

right_join(df1, df2, by = "ID")
##   ID    w  x           y    z a b          c d
## 1  1    a  1  0.72875246    a z 1 0.28661997 b
## 2  3    c  0 -0.09441652    c k 3 0.52568970 d
## 3  7 <NA> NA          NA <NA> b 2 2.09788075 c
## 4  6 <NA> NA          NA <NA> d 0 0.09773865 e
## 5  8 <NA> NA          NA <NA> l 4 0.08717624 f

We can join two tables fully using the full_join command.

full_join(df1, df2, by = "ID")
##   ID    w  x           y    z    a  b          c    d
## 1  1    a  1  0.72875246    a    z  1 0.28661997    b
## 2  2    b  1 -1.14203906    b <NA> NA         NA <NA>
## 3  3    c  0 -0.09441652    c    k  3 0.52568970    d
## 4  4    d  0  0.61865242    d <NA> NA         NA <NA>
## 5  5    e  1  0.24691825    e <NA> NA         NA <NA>
## 6  7 <NA> NA          NA <NA>    b  2 2.09788075    c
## 7  6 <NA> NA          NA <NA>    d  0 0.09773865    e
## 8  8 <NA> NA          NA <NA>    l  4 0.08717624    f

Let’s keep the ID values of df1 but the columns of df2.

semi_join(df1, df2, by="ID")
##   ID w x           y z
## 1  1 a 1  0.72875246 a
## 2  3 c 0 -0.09441652 c

Citations

Bhalla, Deepanshu. 2020. DPLYR Tutorial: Data Manipulation (50 Examples). https://www.listendata.com/2016/08/dplyr-tutorial.html#join-function

Boehmke, B. Apply Family. 2016. http://uc-r.github.io/apply_family

Boehmke, B., & Greenwell, B.M. 2019. Hands-On Machine Learning with R.

Muller, Kirill. 2021. dplyr. https://dplyr.tidyverse.org/