What is dplyr?
- The
dplyr package was developed by Hadley Wickham of RStudio.
dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R.
dplyr is the next iteration of plyr , focussing on only data frames.
- With
dplyr , anything you can do to a local data frame you can also do to a remote database table.
dplyr Functionality
- Five basic verbs:
filter, select, arrange, mutate, summarise and groub_by
- Can work with data stored in databases and data tables
- Joins: inner join, left join, semi-join, anti-join
- Window functions for calculating ranking, offsets, and more
- Better than plyr if you’re only working with data frames (though it doesn’t yet duplicate all of the plyr functionality)
Why dplyr?
- Great for data exploration and transformation
- Intuitive to write and easy to read, especially when using the “chaining” syntax (covered below)
- Fast on data frames
dplyr Grammar
select: return a subset of the column of a data franme, using a flexible notation.
filter: extract a subset of rows from a data frame based on logical conditions.
arrange: reorder rows of data frame
mutate: add new variables/columns or transform existing variables.
summarise/summarize: generate summary statistics of different variables in the data frame, possibly within strata.
%>% “pipe” operator used to connect multiple verb actions together into a pipeline.
Installing dplyr
install.packages('dplyr')
Loading Data: The Gapminder Dataset
# Install gapminder dataset
install.packages('gapminder')
# Load gapminder dataset
library(gapminder)
Exploring the Gapminder Dataset
# Examine first few rows
head(gapminder)
# A tibble: 6 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
# Examine last few rows
tail(gapminder)
# A tibble: 6 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Zimbabwe Africa 1982 60.4 7636524 789.
2 Zimbabwe Africa 1987 62.4 9216418 706.
3 Zimbabwe Africa 1992 60.4 10704340 693.
4 Zimbabwe Africa 1997 46.8 11404948 792.
5 Zimbabwe Africa 2002 40.0 11926563 672.
6 Zimbabwe Africa 2007 43.5 12311143 470.
# Dimensions
dim(gapminder)
[1] 1704 6
# Colnames
names(gapminder)
[1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
# Data Structures
str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
# Summary
summary(gapminder)
country continent year lifeExp
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
Algeria : 12 Asia :396 Median :1980 Median :60.71
Angola : 12 Europe :360 Mean :1980 Mean :59.47
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :82.60
(Other) :1632
pop gdpPercap
Min. :6.001e+04 Min. : 241.2
1st Qu.:2.794e+06 1st Qu.: 1202.1
Median :7.024e+06 Median : 3531.8
Mean :2.960e+07 Mean : 7215.3
3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Max. :1.319e+09 Max. :113523.1
Command Structure (for all dplyr verbs)
- first argument is a data frame
- return value is a data frame
- nothing is modified in place
- Note: dplyr generally does not preserve row names
Load dplyr Package
# Load dplyr
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
select: Pick Single Column by Name
select(gapminder, country)
# A tibble: 1,704 x 1
country
<fct>
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
5 Afghanistan
6 Afghanistan
7 Afghanistan
8 Afghanistan
9 Afghanistan
10 Afghanistan
# … with 1,694 more rows
select: Pick Multiple Columns by Name
select(gapminder, country, continent, year)
# A tibble: 1,704 x 3
country continent year
<fct> <fct> <int>
1 Afghanistan Asia 1952
2 Afghanistan Asia 1957
3 Afghanistan Asia 1962
4 Afghanistan Asia 1967
5 Afghanistan Asia 1972
6 Afghanistan Asia 1977
7 Afghanistan Asia 1982
8 Afghanistan Asia 1987
9 Afghanistan Asia 1992
10 Afghanistan Asia 1997
# … with 1,694 more rows
select: Removing Single Column
select(gapminder, - gdpPercap)
# A tibble: 1,704 x 5
country continent year lifeExp pop
<fct> <fct> <int> <dbl> <int>
1 Afghanistan Asia 1952 28.8 8425333
2 Afghanistan Asia 1957 30.3 9240934
3 Afghanistan Asia 1962 32.0 10267083
4 Afghanistan Asia 1967 34.0 11537966
5 Afghanistan Asia 1972 36.1 13079460
6 Afghanistan Asia 1977 38.4 14880372
7 Afghanistan Asia 1982 39.9 12881816
8 Afghanistan Asia 1987 40.8 13867957
9 Afghanistan Asia 1992 41.7 16317921
10 Afghanistan Asia 1997 41.8 22227415
# … with 1,694 more rows
select: Removing Multiple Columns
select(gapminder, -c(pop, gdpPercap))
# A tibble: 1,704 x 4
country continent year lifeExp
<fct> <fct> <int> <dbl>
1 Afghanistan Asia 1952 28.8
2 Afghanistan Asia 1957 30.3
3 Afghanistan Asia 1962 32.0
4 Afghanistan Asia 1967 34.0
5 Afghanistan Asia 1972 36.1
6 Afghanistan Asia 1977 38.4
7 Afghanistan Asia 1982 39.9
8 Afghanistan Asia 1987 40.8
9 Afghanistan Asia 1992 41.7
10 Afghanistan Asia 1997 41.8
# … with 1,694 more rows
select: Select Column Using : (Range)
select(gapminder, country:year)
# A tibble: 1,704 x 3
country continent year
<fct> <fct> <int>
1 Afghanistan Asia 1952
2 Afghanistan Asia 1957
3 Afghanistan Asia 1962
4 Afghanistan Asia 1967
5 Afghanistan Asia 1972
6 Afghanistan Asia 1977
7 Afghanistan Asia 1982
8 Afghanistan Asia 1987
9 Afghanistan Asia 1992
10 Afghanistan Asia 1997
# … with 1,694 more rows
select: Select Single Column Using “contains”
select(gapminder, contains("c"))
# A tibble: 1,704 x 3
country continent gdpPercap
<fct> <fct> <dbl>
1 Afghanistan Asia 779.
2 Afghanistan Asia 821.
3 Afghanistan Asia 853.
4 Afghanistan Asia 836.
5 Afghanistan Asia 740.
6 Afghanistan Asia 786.
7 Afghanistan Asia 978.
8 Afghanistan Asia 852.
9 Afghanistan Asia 649.
10 Afghanistan Asia 635.
# … with 1,694 more rows
select: Select Multiple Columns Using “contains”
select(gapminder, contains("c"), contains('g'))
# A tibble: 1,704 x 3
country continent gdpPercap
<fct> <fct> <dbl>
1 Afghanistan Asia 779.
2 Afghanistan Asia 821.
3 Afghanistan Asia 853.
4 Afghanistan Asia 836.
5 Afghanistan Asia 740.
6 Afghanistan Asia 786.
7 Afghanistan Asia 978.
8 Afghanistan Asia 852.
9 Afghanistan Asia 649.
10 Afghanistan Asia 635.
# … with 1,694 more rows
select: Select Column Using “starts_with”
select(gapminder, starts_with('c'))
# A tibble: 1,704 x 2
country continent
<fct> <fct>
1 Afghanistan Asia
2 Afghanistan Asia
3 Afghanistan Asia
4 Afghanistan Asia
5 Afghanistan Asia
6 Afghanistan Asia
7 Afghanistan Asia
8 Afghanistan Asia
9 Afghanistan Asia
10 Afghanistan Asia
# … with 1,694 more rows
select: Select Column Using “ends_with”
select(gapminder, ends_with('p'))
# A tibble: 1,704 x 3
lifeExp pop gdpPercap
<dbl> <int> <dbl>
1 28.8 8425333 779.
2 30.3 9240934 821.
3 32.0 10267083 853.
4 34.0 11537966 836.
5 36.1 13079460 740.
6 38.4 14880372 786.
7 39.9 12881816 978.
8 40.8 13867957 852.
9 41.7 16317921 649.
10 41.8 22227415 635.
# … with 1,694 more rows
Chaining Method: The Pipe(%>%) Operator
gapminder %>%
select(country, continent, year) %>%
head()
# A tibble: 6 x 3
country continent year
<fct> <fct> <int>
1 Afghanistan Asia 1952
2 Afghanistan Asia 1957
3 Afghanistan Asia 1962
4 Afghanistan Asia 1967
5 Afghanistan Asia 1972
6 Afghanistan Asia 1977
The Count Verb
gapminder %>%
count()
# A tibble: 1 x 1
n
<int>
1 1704
Count Variable
gapminder %>%
count(country)
# A tibble: 142 x 2
country n
* <fct> <int>
1 Afghanistan 12
2 Albania 12
3 Algeria 12
4 Angola 12
5 Argentina 12
6 Australia 12
7 Austria 12
8 Bahrain 12
9 Bangladesh 12
10 Belgium 12
# … with 132 more rows
Count and Sort
gapminder %>%
count(country, sort = TRUE)
# A tibble: 142 x 2
country n
<fct> <int>
1 Afghanistan 12
2 Albania 12
3 Algeria 12
4 Angola 12
5 Argentina 12
6 Australia 12
7 Austria 12
8 Bahrain 12
9 Bangladesh 12
10 Belgium 12
# … with 132 more rows
Count Population
gapminder %>%
select(country, pop) %>%
count(country, wt = pop, sort = TRUE)
# A tibble: 142 x 2
country n
<fct> <dbl>
1 China 11497920623
2 India 8413568878
3 United States 2738534790
4 Indonesia 1779874000
5 Brazil 1467745520
6 Japan 1341105696
7 Pakistan 1124200629
8 Bangladesh 1089064744
9 Germany 930564520
10 Nigeria 884496214
# … with 132 more rows
filter: Equality("==")
filter(gapminder, country == "Bangladesh")
# A tibble: 12 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Bangladesh Asia 1952 37.5 46886859 684.
2 Bangladesh Asia 1957 39.3 51365468 662.
3 Bangladesh Asia 1962 41.2 56839289 686.
4 Bangladesh Asia 1967 43.5 62821884 721.
5 Bangladesh Asia 1972 45.3 70759295 630.
6 Bangladesh Asia 1977 46.9 80428306 660.
7 Bangladesh Asia 1982 50.0 93074406 677.
8 Bangladesh Asia 1987 52.8 103764241 752.
9 Bangladesh Asia 1992 56.0 113704579 838.
10 Bangladesh Asia 1997 59.4 123315288 973.
11 Bangladesh Asia 2002 62.0 135656790 1136.
12 Bangladesh Asia 2007 64.1 150448339 1391.
filter: Inequality("!=")
filter(gapminder, country != "Bangladesh")
# A tibble: 1,692 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# … with 1,682 more rows
filter: Greater(">")
filter(gapminder, gdpPercap > 800)
# A tibble: 1,460 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1957 30.3 9240934 821.
2 Afghanistan Asia 1962 32.0 10267083 853.
3 Afghanistan Asia 1967 34.0 11537966 836.
4 Afghanistan Asia 1982 39.9 12881816 978.
5 Afghanistan Asia 1987 40.8 13867957 852.
6 Afghanistan Asia 2007 43.8 31889923 975.
7 Albania Europe 1952 55.2 1282697 1601.
8 Albania Europe 1957 59.3 1476505 1942.
9 Albania Europe 1962 64.8 1728137 2313.
10 Albania Europe 1967 66.2 1984060 2760.
# … with 1,450 more rows
filter: Greater or Equal(">=")
filter(gapminder, gdpPercap >= 800)
# A tibble: 1,460 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1957 30.3 9240934 821.
2 Afghanistan Asia 1962 32.0 10267083 853.
3 Afghanistan Asia 1967 34.0 11537966 836.
4 Afghanistan Asia 1982 39.9 12881816 978.
5 Afghanistan Asia 1987 40.8 13867957 852.
6 Afghanistan Asia 2007 43.8 31889923 975.
7 Albania Europe 1952 55.2 1282697 1601.
8 Albania Europe 1957 59.3 1476505 1942.
9 Albania Europe 1962 64.8 1728137 2313.
10 Albania Europe 1967 66.2 1984060 2760.
# … with 1,450 more rows
filter: Less("<")
filter(gapminder, gdpPercap < 800)
# A tibble: 244 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1972 36.1 13079460 740.
3 Afghanistan Asia 1977 38.4 14880372 786.
4 Afghanistan Asia 1992 41.7 16317921 649.
5 Afghanistan Asia 1997 41.8 22227415 635.
6 Afghanistan Asia 2002 42.1 25268405 727.
7 Bangladesh Asia 1952 37.5 46886859 684.
8 Bangladesh Asia 1957 39.3 51365468 662.
9 Bangladesh Asia 1962 41.2 56839289 686.
10 Bangladesh Asia 1967 43.5 62821884 721.
# … with 234 more rows
filter: Less or Equal("<=")
filter(gapminder, gdpPercap <= 800)
# A tibble: 244 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1972 36.1 13079460 740.
3 Afghanistan Asia 1977 38.4 14880372 786.
4 Afghanistan Asia 1992 41.7 16317921 649.
5 Afghanistan Asia 1997 41.8 22227415 635.
6 Afghanistan Asia 2002 42.1 25268405 727.
7 Bangladesh Asia 1952 37.5 46886859 684.
8 Bangladesh Asia 1957 39.3 51365468 662.
9 Bangladesh Asia 1962 41.2 56839289 686.
10 Bangladesh Asia 1967 43.5 62821884 721.
# … with 234 more rows
filter: Logical AND("&")
filter(gapminder, country=="Bangladesh" & gdpPercap > 800)
# A tibble: 4 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Bangladesh Asia 1992 56.0 113704579 838.
2 Bangladesh Asia 1997 59.4 123315288 973.
3 Bangladesh Asia 2002 62.0 135656790 1136.
4 Bangladesh Asia 2007 64.1 150448339 1391.
filter: Logical OR("|")
filter(gapminder, country == "Bangladesh" | gdpPercap > 800)
# A tibble: 1,468 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1957 30.3 9240934 821.
2 Afghanistan Asia 1962 32.0 10267083 853.
3 Afghanistan Asia 1967 34.0 11537966 836.
4 Afghanistan Asia 1982 39.9 12881816 978.
5 Afghanistan Asia 1987 40.8 13867957 852.
6 Afghanistan Asia 2007 43.8 31889923 975.
7 Albania Europe 1952 55.2 1282697 1601.
8 Albania Europe 1957 59.3 1476505 1942.
9 Albania Europe 1962 64.8 1728137 2313.
10 Albania Europe 1967 66.2 1984060 2760.
# … with 1,458 more rows
filter: The "%in%" Operator
filter(gapminder, country %in% c("Bangladesh", "Australia"))
# A tibble: 24 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Australia Oceania 1952 69.1 8691212 10040.
2 Australia Oceania 1957 70.3 9712569 10950.
3 Australia Oceania 1962 70.9 10794968 12217.
4 Australia Oceania 1967 71.1 11872264 14526.
5 Australia Oceania 1972 71.9 13177000 16789.
6 Australia Oceania 1977 73.5 14074100 18334.
7 Australia Oceania 1982 74.7 15184200 19477.
8 Australia Oceania 1987 76.3 16257249 21889.
9 Australia Oceania 1992 77.6 17481977 23425.
10 Australia Oceania 1997 78.8 18565243 26998.
# … with 14 more rows
mutate: Creating New Column
gapminder %>%
mutate(gdp = gdpPercap * pop) %>%
head()
# A tibble: 6 x 7
country continent year lifeExp pop gdpPercap gdp
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
# GDP in Million
gapminder %>%
mutate(gdp = gdpPercap * pop / 10^6) %>%
head()
# A tibble: 6 x 7
country continent year lifeExp pop gdpPercap gdp
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6567.
2 Afghanistan Asia 1957 30.3 9240934 821. 7585.
3 Afghanistan Asia 1962 32.0 10267083 853. 8759.
4 Afghanistan Asia 1967 34.0 11537966 836. 9648.
5 Afghanistan Asia 1972 36.1 13079460 740. 9679.
6 Afghanistan Asia 1977 38.4 14880372 786. 11698.
arrange : Reorder Rows
gapminder %>%
select(country, pop) %>%
arrange(pop) %>%
head()
# A tibble: 6 x 2
country pop
<fct> <int>
1 Sao Tome and Principe 60011
2 Sao Tome and Principe 61325
3 Djibouti 63149
4 Sao Tome and Principe 65345
5 Sao Tome and Principe 70787
6 Djibouti 71851
arrange : Reorder Rows(Descending)
gapminder %>%
select(country, pop) %>%
# descending order
arrange(desc(pop)) %>%
head()
# A tibble: 6 x 2
country pop
<fct> <int>
1 China 1318683096
2 China 1280400000
3 China 1230075000
4 China 1164970000
5 India 1110396331
6 China 1084035000
group_by: Grouping Data
gapminder %>%
group_by(continent) %>%
head()
# A tibble: 6 x 6
# Groups: continent [1]
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
group_by and summarize
gapminder %>%
summarize(total_pop = sum(pop))
# A tibble: 1 x 1
total_pop
<dbl>
1 50440465801
Aggregate and summarize
gapminder %>%
summarize(total_population = sum(pop),
avg_gdppercap = mean(gdpPercap))
# A tibble: 1 x 2
total_population avg_gdppercap
<dbl> <dbl>
1 50440465801 7215.
summarise: Sum
gapminder %>%
group_by(continent) %>%
summarise(pop = sum(pop))
# A tibble: 5 x 2
continent pop
* <fct> <dbl>
1 Africa 6187585961
2 Americas 7351438499
3 Asia 30507333901
4 Europe 6181115304
5 Oceania 212992136
summarise: Sum
gapminder %>%
group_by(continent) %>%
# In Million
summarise(pop = sum(pop) / 10^6)
# A tibble: 5 x 2
continent pop
* <fct> <dbl>
1 Africa 6188.
2 Americas 7351.
3 Asia 30507.
4 Europe 6181.
5 Oceania 213.
summarise: Maximum
gapminder %>%
group_by(continent) %>%
summarise(max_liexp = max(lifeExp))
# A tibble: 5 x 2
continent max_liexp
* <fct> <dbl>
1 Africa 76.4
2 Americas 80.7
3 Asia 82.6
4 Europe 81.8
5 Oceania 81.2
Summary Functions
mean()
sum()
median()
min()
max()
n()
Aggregate within Groups
gapminder %>%
group_by(continent) %>%
summarize(total_pop = sum(pop),
avg_lifeexp = mean(lifeExp))
# A tibble: 5 x 3
continent total_pop avg_lifeexp
* <fct> <dbl> <dbl>
1 Africa 6187585961 48.9
2 Americas 7351438499 64.7
3 Asia 30507333901 60.1
4 Europe 6181115304 71.9
5 Oceania 212992136 74.3
The top_n Verb
gapminder %>%
group_by(continent) %>%
top_n(1, pop)
# A tibble: 5 x 6
# Groups: continent [5]
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Australia Oceania 2007 81.2 20434176 34435.
2 China Asia 2007 73.0 1318683096 4959.
3 Germany Europe 2007 79.4 82400996 32170.
4 Nigeria Africa 2007 46.9 135031164 2014.
5 United States Americas 2007 78.2 301139947 42952.
rename: Renaming Column
gapminder %>%
rename(population = pop) %>%
head()
# A tibble: 6 x 6
country continent year lifeExp population gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.