The dplyr package is a package developed by Hadley Wickham. It is widely trusted and is updated regularly. Thus, it is an excellent tool for your data science tool box. dplyr is used to help solve data manipulation challenges and because it was built for the sole purpose of simplifying the process of manipulating, sorting, summarizing and joining data frames, code created in this way tends to flow together and be less choppy than if one cobbled together other R functions.
Fundamental tools in dplyr include:
| Tool | Function |
|---|---|
| mutate() | creates new variables from old ones. |
| select() | selects variables based on their names. |
| filter() | provides basic filtering capabilities. |
| summarise() | summarizes the data by functions of your choice. |
| arrange() | changes the ordering of the rows. |
| join() | joins separate dataframes. |
| group_by() | groups data by categorical level. |
A good reference for this material is the data transformation chapter in R for data science. A cheat sheet is also available.
library(tidyverse)
library(dplyr)
We will be using this dataset. Download and import the data.
surveys <- read_csv("https://raw.githubusercontent.com/deepanshu88/data/master/sampledata.csv")
head(surveys)
## # A tibble: 6 x 16
## Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A Alab~ 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6
## 2 A Alas~ 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6
## 3 A Ariz~ 1.74e6 1.97e6 1.38e6 1.78e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6
## 4 A Arka~ 1.49e6 1.99e6 1.12e6 1.95e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6
## 5 C Cali~ 1.69e6 1.68e6 1.89e6 1.48e6 1.74e6 1.81e6 1.49e6 1.66e6 1.62e6
## 6 C Colo~ 1.34e6 1.88e6 1.89e6 1.24e6 1.87e6 1.81e6 1.88e6 1.75e6 1.91e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## # Y2014 <dbl>, Y2015 <dbl>
While the pipe operator is not technically a requirement for the tidyverse or dplyr, it certainly increases the ease with which we make use of the tools.
At a basic level, the pipe forwards a value or the result of an expression into the next function. We can then omit the data call in the function. For example: \[\text{filter(data, variable == value)}\] can also be written as \[\text{data %>% filter(variable == value)}.\] While that does not seem to be a substantial change or even an upgrade at all, more complex strings of code will become increasingly simple with this tool. More on this later.
In this example, we select the survey data from 2003 only. We want our final data set to have the state name with the 2003 data. So, we just list the 2 column names we want to keep.
new_dat <- surveys %>%
select(State, Y2003)
head(new_dat,5)
## # A tibble: 5 x 2
## State Y2003
## <chr> <dbl>
## 1 Alabama 1317711
## 2 Alaska 1960378
## 3 Arizona 1968140
## 4 Arkansas 1994927
## 5 California 1675807
We can also use select() to drop variables. Here we drop 2003 to 2005 data.
new_dat <- surveys %>%
select(-c(Y2003:Y2005))
head(new_dat,5)
## # A tibble: 5 x 13
## Index State Y2002 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A Alab~ 1.30e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6 1.44e6 1.19e6 1.85e6
## 2 A Alas~ 1.17e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6 1.23e6 1.51e6 1.99e6
## 3 A Ariz~ 1.74e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6 1.13e6 1.91e6 1.36e6
## 4 A Arka~ 1.49e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6 1.93e6 1.22e6 1.59e6
## 5 C Cali~ 1.69e6 1.74e6 1.81e6 1.49e6 1.66e6 1.62e6 1.64e6 1.92e6 1.16e6
## # ... with 2 more variables: Y2014 <dbl>, Y2015 <dbl>
A number of features also help select based on variable names.
| Helpers | Description |
|---|---|
| starts_with() | Starts with a prefix |
| ends_with() | Ends with a prefix |
| contains() | Contains a literal string |
| matches() | Matches a regular expression |
| num_range() | Numerical range like x01, x02, x03. |
| one_of() | Variables in character vector. |
| everything() | All variables. |
If we wanted to keep the year columns, we could specify either starts_with(“Y”) or contains(“20”).
new_dat <- surveys %>%
select(contains("20"))
head(new_dat,5)
## # A tibble: 5 x 14
## Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6 1.44e6 1.19e6
## 2 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6 1.23e6 1.51e6
## 3 1.74e6 1.97e6 1.38e6 1.78e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6 1.13e6 1.91e6
## 4 1.49e6 1.99e6 1.12e6 1.95e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6 1.93e6 1.22e6
## 5 1.69e6 1.68e6 1.89e6 1.48e6 1.74e6 1.81e6 1.49e6 1.66e6 1.62e6 1.64e6 1.92e6
## # ... with 3 more variables: Y2013 <dbl>, Y2014 <dbl>, Y2015 <dbl>
In order to specify a subset of the data, we use the filter() function. In this example, we filter the survey data with Index “A”.
new_dat <- surveys %>%
filter(Index=="A")
head(new_dat,5)
## # A tibble: 4 x 16
## Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A Alab~ 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6
## 2 A Alas~ 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6
## 3 A Ariz~ 1.74e6 1.97e6 1.38e6 1.78e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6
## 4 A Arka~ 1.49e6 1.99e6 1.12e6 1.95e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## # Y2014 <dbl>, Y2015 <dbl>
Logical connectives have the obvious interpretation. The connectives and (“&”) and or (“|”) can also be used. The exclamation point is used as the symbol for not (“!”).
| Symbol | Meaning | Symbol | Meaning |
|---|---|---|---|
| \(<\) | Less than | \(!=\) | Not equal to |
| \(>\) | Greater than | \(\%in\%\) | Group membership |
| \(==\) | Equal to | \(is.na\) | is NA |
| \(<=\) | Less than or equal to | \(!is.na\) | is not NA |
| \(>=\) | Greater than or equal to | \(\&,|,!\) | Boolean operators (and, or, not) |
In this example, we filter out data with Index “A” or “C”.
new_dat <- surveys %>%
filter(Index %in% c("A","C"))
head(new_dat,5)
## # A tibble: 5 x 16
## Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A Alab~ 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6
## 2 A Alas~ 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6
## 3 A Ariz~ 1.74e6 1.97e6 1.38e6 1.78e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6
## 4 A Arka~ 1.49e6 1.99e6 1.12e6 1.95e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6
## 5 C Cali~ 1.69e6 1.68e6 1.89e6 1.48e6 1.74e6 1.81e6 1.49e6 1.66e6 1.62e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## # Y2014 <dbl>, Y2015 <dbl>
Next, we want survey data with 2003 data between 1300000 and 1500000.
new_dat <- surveys %>%
filter(Y2003 > 1300000, Y2003 < 1500000)
head(new_dat,5)
## # A tibble: 5 x 16
## Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A Alab~ 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6
## 2 F Flor~ 1.96e6 1.47e6 1.42e6 1.36e6 1.34e6 1.28e6 1.76e6 1.82e6 1.20e6
## 3 I Idaho 1.35e6 1.44e6 1.74e6 1.54e6 1.12e6 1.77e6 1.34e6 1.75e6 1.44e6
## 4 I Iowa 1.50e6 1.44e6 1.58e6 1.39e6 1.55e6 1.45e6 1.32e6 1.15e6 1.75e6
## 5 K Kent~ 1.81e6 1.45e6 1.80e6 1.25e6 1.14e6 1.91e6 1.30e6 1.96e6 1.35e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## # Y2014 <dbl>, Y2015 <dbl>
If we are looking for data that contains a string, we can filter using the grepl command. In this example, we filter any data value in the 2002 column that contains the string “11”.
new_dat <- surveys %>%
filter(grepl("11", Y2002))
head(new_dat,5)
## # A tibble: 5 x 16
## Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A Alas~ 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6
## 2 D Dist~ 1.11e6 1.99e6 1.37e6 1.83e6 1.80e6 1.60e6 1.19e6 1.74e6 1.71e6
## 3 N Neva~ 1.43e6 1.11e6 1.12e6 1.76e6 1.69e6 1.77e6 1.90e6 1.23e6 1.53e6
## 4 O Okla~ 1.17e6 1.33e6 1.66e6 1.80e6 1.31e6 1.71e6 1.26e6 1.85e6 1.67e6
## 5 S Sout~ 1.16e6 1.15e6 1.66e6 1.42e6 1.42e6 1.28e6 1.17e6 1.85e6 1.55e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## # Y2014 <dbl>, Y2015 <dbl>
Just to verify the point, here is the resulting 2002 column and associated state name only. Each entry in the 2002 column has consecutive 1’s.
head(new_dat,5)[,c(2,3)]
## # A tibble: 5 x 2
## State Y2002
## <chr> <dbl>
## 1 Alaska 1170302
## 2 District of Columbia 1111437
## 3 Nevada 1426117
## 4 Oklahoma 1173918
## 5 South Dakota 1159037
The summarise() function helps us create a summary statistic for a particular variable. For example, we find the mean and median for 2002.
new_dat <- surveys %>%
summarise(Y2002_Mean = mean(Y2002), Y2002_Median = median(Y2002))
new_dat
## # A tibble: 1 x 2
## Y2002_Mean Y2002_Median
## <dbl> <dbl>
## 1 1566034. 1584734
If we want to summarize multiple years, we use the vars command along with a list of functions. Note that we are using the summarise_at function rather than summarise.
new_dat <- surveys %>%
summarise_at(vars(Y2002:Y2007), list(Mean = mean, Median = median))
head(new_dat,5)
## # A tibble: 1 x 12
## Y2002_Mean Y2003_Mean Y2004_Mean Y2005_Mean Y2006_Mean Y2007_Mean Y2002_Median
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1566034. 1509193. 1540555. 1522064. 1530969. 1553219. 1584734
## # ... with 5 more variables: Y2003_Median <dbl>, Y2004_Median <dbl>,
## # Y2005_Median <dbl>, Y2006_Median <dbl>, Y2007_Median <dbl>
For non-standard functions, we can create a function inside the summarise_at function. For example, if I wanted to display the data from 2002 and 2003 in terms of millions (ie, 1566034 displayed as 1.566034).
new_dat <- surveys %>%
summarise_at(vars(Y2002,Y2003), function(x) (x/1000000))
head(new_dat,5)
## # A tibble: 5 x 2
## Y2002 Y2003
## <dbl> <dbl>
## 1 1.30 1.32
## 2 1.17 1.96
## 3 1.74 1.97
## 4 1.49 1.99
## 5 1.69 1.68
The arrange function allows us to sort data. For example, in descending (desc) or ascending (default) order of one of the variables. Here, we arrange the data in descending order by the 2002 column.
new_dat <- surveys %>%
arrange(desc(Y2002))
head(new_dat,5)
## # A tibble: 5 x 16
## Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M Miss~ 1.98e6 1.29e6 1.63e6 1.94e6 1.35e6 1.73e6 1.43e6 1.57e6 1.38e6
## 2 W Wash~ 1.98e6 1.69e6 1.20e6 1.16e6 1.33e6 1.62e6 1.55e6 1.56e6 1.18e6
## 3 F Flor~ 1.96e6 1.47e6 1.42e6 1.36e6 1.34e6 1.28e6 1.76e6 1.82e6 1.20e6
## 4 G Geor~ 1.93e6 1.54e6 1.81e6 1.78e6 1.33e6 1.22e6 1.77e6 1.63e6 1.15e6
## 5 N Nebr~ 1.89e6 1.31e6 1.43e6 1.24e6 1.50e6 1.28e6 1.14e6 1.27e6 1.13e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## # Y2014 <dbl>, Y2015 <dbl>
We can sort by multiple variables by specifying them in order. Here, we first sort by Index and then any ties are broken by the 2002 data.
new_dat <- surveys %>%
arrange(Index, Y2002)
head(new_dat,5)
## # A tibble: 5 x 16
## Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A Alas~ 1.17e6 1.96e6 1.82e6 1.45e6 1.86e6 1.47e6 1.55e6 1.44e6 1.63e6
## 2 A Alab~ 1.30e6 1.32e6 1.12e6 1.49e6 1.11e6 1.44e6 1.95e6 1.94e6 1.24e6
## 3 A Arka~ 1.49e6 1.99e6 1.12e6 1.95e6 1.67e6 1.80e6 1.19e6 1.63e6 1.67e6
## 4 A Ariz~ 1.74e6 1.97e6 1.38e6 1.78e6 1.10e6 1.11e6 1.75e6 1.55e6 1.30e6
## 5 C Colo~ 1.34e6 1.88e6 1.89e6 1.24e6 1.87e6 1.81e6 1.88e6 1.75e6 1.91e6
## # ... with 5 more variables: Y2011 <dbl>, Y2012 <dbl>, Y2013 <dbl>,
## # Y2014 <dbl>, Y2015 <dbl>
The group by function allows for groupings by a categorical variable. It is a silent variable in that applying it to the data by itself does not provide any observable manipulation. Rather, the change you see is when we apply the function in conjunction with other functions. In this example, we find the mean and median of the data from 2003 – 2005 grouped by Index.
new_dat <- surveys %>%
group_by(Index) %>%
summarise_at(vars(Y2003:Y2005), list(mean=mean, median=median))
head(new_dat,5)
## # A tibble: 5 x 7
## Index Y2003_mean Y2004_mean Y2005_mean Y2003_median Y2004_median Y2005_median
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 1810289 1358400. 1667653. 1964259 1248441 1637391
## 2 C 1595708 1652556 1411970 1675807 1886149 1480280
## 3 D 1631207 1540697 1615854 1631207 1540697 1615854
## 4 F 1468852 1419738 1362787 1468852 1419738 1362787
## 5 G 1541565 1810773 1779091 1541565 1810773 1779091
You will note that we used multiple pipes in the preceding example. First, the surveys data in grouped by the variable Index. Then, the grouped data is sent to the summarise_at function.
In order to create a new variable, we use the mutate function. For example, suppose we want to find the change in values from 2002 to 2015. We will display only the Index, State, Y2002, Y2015 and absolute change in our final table.
new_dat <- surveys %>%
mutate(change = Y2015 - Y2002) %>%
select(Index, State, Y2002, Y2015, change)
head(new_dat,5)
## # A tibble: 5 x 5
## Index State Y2002 Y2015 change
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 A Alabama 1296530 1916661 620131
## 2 A Alaska 1170302 1979143 808841
## 3 A Arizona 1742027 1647724 -94303
## 4 A Arkansas 1485531 1329341 -156190
## 5 C California 1685349 1644607 -40742
The join function is a way to join two data sets.
| Join Function | Description |
|---|---|
| inner_join(x, y, by = ) | Include only rows in both x and y that have a matching value |
| left_join(x, y, by = ) | Include all of x, and matching rows of y |
| right_join(x, y, by = ) | Include all of y, and matching rows of x |
| full_join(x, y, by = ) | Include all rows in both x and y |
| semi_join(x, y, by = ) | Include rows of x that match y but only keep the columns from x |
| anti_join(x, y, by = ) | Opposite of semi_join |
We invent a data set to illustrate the join function. Below, we define 2 data frames.
df1=data.frame(ID = c(1, 2, 3, 4, 5),
w=c('a', 'b', 'c', 'd', 'e'),
x=c(1, 1, 0, 0, 1),
y=rnorm(5),
z=letters[1:5]
)
df1
## ID w x y z
## 1 1 a 1 0.72875246 a
## 2 2 b 1 -1.14203906 b
## 3 3 c 0 -0.09441652 c
## 4 4 d 0 0.61865242 d
## 5 5 e 1 0.24691825 e
df2=data.frame(ID = c(1, 7, 3, 6, 8),
a=c('z', 'b', 'k', 'd', 'l'),
b=c(1, 2, 3, 0, 4),
c=rnorm(5),
d=letters[2:6]
)
df2
## ID a b c d
## 1 1 z 1 0.28661997 b
## 2 7 b 2 2.09788075 c
## 3 3 k 3 0.52568970 d
## 4 6 d 0 0.09773865 e
## 5 8 l 4 0.08717624 f
We first begin joining common rows of a table (inner_join). In this case, row ID 1 and 3 are common and so will be joined. If the primary key (the by = "" piece) is not the same, try by = c(“ID1”, “ID2”).
inner_join(df1, df2, by = "ID")
## ID w x y z a b c d
## 1 1 a 1 0.72875246 a z 1 0.2866200 b
## 2 3 c 0 -0.09441652 c k 3 0.5256897 d
A left join will maintain all of the rows from the left table (based on the ID column). It will add values from the right table and encode NA’s where no match exists.
left_join(df1, df2, by = "ID")
## ID w x y z a b c d
## 1 1 a 1 0.72875246 a z 1 0.2866200 b
## 2 2 b 1 -1.14203906 b <NA> NA NA <NA>
## 3 3 c 0 -0.09441652 c k 3 0.5256897 d
## 4 4 d 0 0.61865242 d <NA> NA NA <NA>
## 5 5 e 1 0.24691825 e <NA> NA NA <NA>
Here we get the opposite (use the right ID values), but note the order of columns.
right_join(df1, df2, by = "ID")
## ID w x y z a b c d
## 1 1 a 1 0.72875246 a z 1 0.28661997 b
## 2 3 c 0 -0.09441652 c k 3 0.52568970 d
## 3 7 <NA> NA NA <NA> b 2 2.09788075 c
## 4 6 <NA> NA NA <NA> d 0 0.09773865 e
## 5 8 <NA> NA NA <NA> l 4 0.08717624 f
We can join two tables fully using the full_join command.
full_join(df1, df2, by = "ID")
## ID w x y z a b c d
## 1 1 a 1 0.72875246 a z 1 0.28661997 b
## 2 2 b 1 -1.14203906 b <NA> NA NA <NA>
## 3 3 c 0 -0.09441652 c k 3 0.52568970 d
## 4 4 d 0 0.61865242 d <NA> NA NA <NA>
## 5 5 e 1 0.24691825 e <NA> NA NA <NA>
## 6 7 <NA> NA NA <NA> b 2 2.09788075 c
## 7 6 <NA> NA NA <NA> d 0 0.09773865 e
## 8 8 <NA> NA NA <NA> l 4 0.08717624 f
Let’s keep the ID values of df1 but the columns of df2.
semi_join(df1, df2, by="ID")
## ID w x y z
## 1 1 a 1 0.72875246 a
## 2 3 c 0 -0.09441652 c
Bhalla, Deepanshu. 2020. DPLYR Tutorial: Data Manipulation (50 Examples). https://www.listendata.com/2016/08/dplyr-tutorial.html#join-function
Boehmke, B. Apply Family. 2016. http://uc-r.github.io/apply_family
Boehmke, B., & Greenwell, B.M. 2019. Hands-On Machine Learning with R.
Muller, Kirill. 2021. dplyr. https://dplyr.tidyverse.org/