In this lab activity, we’ll discuss Data Wrangling/Transformation via
the dplyr package. We’ll explore ways to choose subsets of
data, aggregate data to create summaries, make new variables, and sort
your data frames. It is recommended you also explore the RStudio
Cheatsheet on Data
Transformation as we discuss this content.
Back to gapminder
Here is a look at the gapminder data frame in the
gapminder package.
library(gapminder)
rmarkdown::paged_table(gapminder)
# Has been set as the default way to print data frames
# via df_print: paged in the YAML at the top
Say we wanted mean life expectancy across all years for Asia
# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
[1] 60.0649
library(dplyr)
gapminder %>%
filter(continent == "Asia") %>%
summarize(mean_exp = mean(lifeExp))
gapminder %>%
filter(year == 2002 | continent == "Europe")
- Use
& or , to check for all of
multiple filters being true:
gapminder %>%
filter(year == 2002, continent == "Europe")
- Use
%in% to check for any being true (shortcut to using
| repeatedly with ==)
gapminder %>%
filter(country %in% c("Argentina", "Belgium", "Mexico"),
year %in% c(1987, 1992))
summarize()
- Any numerical summary that you want to apply to a column of a data
frame is specified within
summarize().
max_exp_1997 <- gapminder %>%
filter(year == 1997) %>%
summarize(max_exp = max(lifeExp))
max_exp_1997
Combining summarize() with group_by()
When you’d like to determine a numerical summary for all levels of a
different categorical variable
max_exp_1997_by_cont <- gapminder %>%
filter(year == 1997) %>%
group_by(continent) %>%
summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont
Without the %>%
It’s hard to appreciate the %>% without seeing what
the code would look like without it:
max_exp_1997_by_cont <-
summarize(
group_by(
filter(
gapminder,
year == 1997),
continent),
max_exp = max(lifeExp))
max_exp_1997_by_cont
mutate()
- Allows you to
- create a new variable based on other variables OR
- change the contents of an existing variable
- create a new variable based on other variables
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
gap_w_gdp
mutate()
- change the contents of an existing variable
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
gap_weird
arrange()
- Reorders the rows in a data frame based on the values of one or more
variables
gapminder %>%
arrange(year, country)
- Can also put into descending order
gapminder %>%
filter(year > 2000) %>%
arrange(desc(lifeExp))
Other useful dplyr verbs
select
top_n
sample_n
slice
glimpse
rename
Your Task
Determine which African country had the highest GDP per capita in
1982 using the gapminder data in the gapminder
package.
library(gapminder)
library(dplyr)
gapminder %>%
filter(year == 1982, continent == "Africa") %>%
arrange(desc(gdpPercap))
Libya has the highest GDP per capita in Africa in 1982.
Challenge
For both of these problems below, use the bechdel data
frame in the fivethirtyeight package:
- Use the
count function in the dplyr
package to determine how many movies in 2013 fell into each of the
different categories for clean_test
- Determine the percentage of movies that received the value of
"ok" across all years
library(dplyr)
library(fivethirtyeight)
view(bechdel)
bechdel %>%
filter(year == 2013) %>%
count(clean_test)
bechdel %>%
count(clean_test) %>%
mutate(percentage = n / sum(n) * 100) %>%
filter(clean_test == "ok")
Your Task
Determine the top five movies in terms of domestic return on
investment for 2013 scaled data using the bechdel data
frame in the fivethirtyeight package.
bechdel %>%
mutate(domestic_roi_2013 = domgross_2013 / budget_2013) %>%
filter(year == 2013) %>%
arrange(desc(domestic_roi_2013)) %>%
select(title, year, domestic_roi_2013)
The top 5 movies in domestic return on investment in 2013 in order is
The Purge, Insidious: Chapter 2, The Conjuring, Despicable Me 2, and
Mama.
LS0tDQp0aXRsZTogJ0RhdGEgV3JhbmdsaW5nIExhYjogSW50cm8gdG8gYGRwbHlyYCcNCmF1dGhvcjogIkNoZXN0ZXIgSXNtYXkiDQpvdXRwdXQ6DQogIGh0bWxfZG9jdW1lbnQ6DQogICAgY29kZV9kb3dubG9hZDogdHJ1ZQ0KICAgIGNvZGVfZm9sZGluZzogaGlkZQ0KICAgIGRmX3ByaW50OiBwYWdlZA0KICBwZGZfZG9jdW1lbnQ6IGRlZmF1bHQNCiAgd29yZF9kb2N1bWVudDogZGVmYXVsdA0KLS0tDQoNCmBgYHtyIGluY2x1ZGU9RkFMU0V9DQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCmtuaXRyOjpvcHRzX2NodW5rJHNldChtZXNzYWdlPUZBTFNFKQ0KZmlsdGVyIDwtIGRwbHlyOjpmaWx0ZXINCmtuaXRyOjpvcHRzX2NodW5rJHNldCh3YXJuaW5nPUZBTFNFLCBtZXNzYWdlPUZBTFNFLCBmaWcud2lkdGg9OS41LCBmaWcuaGVpZ2h0PTQuNSwgY29tbWVudD1OQSwgcm93cy5wcmludD0xNiwgb3V0LndpZHRoID0gIlxcdGV4dHdpZHRoIikNCnRoZW1lX3NldCh0aGVtZV9ncmF5KGJhc2Vfc2l6ZSA9IDIwKSkNCmBgYA0KDQpJbiB0aGlzIGxhYiBhY3Rpdml0eSwgd2UnbGwgZGlzY3VzcyBEYXRhIFdyYW5nbGluZy9UcmFuc2Zvcm1hdGlvbiB2aWEgdGhlIGBkcGx5cmAgcGFja2FnZS4gV2UnbGwgZXhwbG9yZSB3YXlzIHRvIGNob29zZSBzdWJzZXRzIG9mIGRhdGEsIGFnZ3JlZ2F0ZSBkYXRhIHRvIGNyZWF0ZSBzdW1tYXJpZXMsIG1ha2UgbmV3IHZhcmlhYmxlcywgYW5kIHNvcnQgeW91ciBkYXRhIGZyYW1lcy4gSXQgaXMgcmVjb21tZW5kZWQgeW91IGFsc28gZXhwbG9yZSB0aGUgUlN0dWRpbyBDaGVhdHNoZWV0IG9uIFtEYXRhIFRyYW5zZm9ybWF0aW9uXShodHRwczovL2dpdGh1Yi5jb20vcnN0dWRpby9jaGVhdHNoZWV0cy9yYXcvbWFzdGVyL3NvdXJjZS9wZGZzL2RhdGEtdHJhbnNmb3JtYXRpb24tY2hlYXRzaGVldC5wZGYpIGFzIHdlIGRpc2N1c3MgdGhpcyBjb250ZW50Lg0KDQojIyMgQmFjayB0byBgZ2FwbWluZGVyYA0KDQpIZXJlIGlzIGEgbG9vayBhdCB0aGUgYGdhcG1pbmRlcmAgZGF0YSBmcmFtZSBpbiB0aGUgYGdhcG1pbmRlcmAgcGFja2FnZS4NCg0KYGBge3J9DQpsaWJyYXJ5KGdhcG1pbmRlcikNCnJtYXJrZG93bjo6cGFnZWRfdGFibGUoZ2FwbWluZGVyKQ0KIyBIYXMgYmVlbiBzZXQgYXMgdGhlIGRlZmF1bHQgd2F5IHRvIHByaW50IGRhdGEgZnJhbWVzDQojIHZpYSBkZl9wcmludDogcGFnZWQgaW4gdGhlIFlBTUwgYXQgdGhlIHRvcA0KYGBgDQoNClNheSB3ZSB3YW50ZWQgbWVhbiBsaWZlIGV4cGVjdGFuY3kgYWNyb3NzIGFsbCB5ZWFycyBmb3IgQXNpYQ0KDQpgYGB7cn0NCiMgQmFzZSBSDQphc2lhIDwtIGdhcG1pbmRlcltnYXBtaW5kZXIkY29udGluZW50ID09ICJBc2lhIiwgXQ0KbWVhbihhc2lhJGxpZmVFeHApDQpgYGANCg0KYGBge3J9DQpsaWJyYXJ5KGRwbHlyKQ0KZ2FwbWluZGVyICU+JSANCiAgZmlsdGVyKGNvbnRpbmVudCA9PSAiQXNpYSIpICU+JQ0KICBzdW1tYXJpemUobWVhbl9leHAgPSBtZWFuKGxpZmVFeHApKQ0KYGBgDQoNCg0KYGBge3J9DQpnYXBtaW5kZXIgJT4lIA0KICBmaWx0ZXIoeWVhciA9PSAyMDAyIHwgY29udGluZW50ID09ICJFdXJvcGUiKQ0KYGBgDQoNCi0gICBVc2UgYCZgIG9yIGAsYCB0byBjaGVjayBmb3IgYWxsIG9mIG11bHRpcGxlIGZpbHRlcnMgYmVpbmcgdHJ1ZToNCg0KYGBge3J9DQpnYXBtaW5kZXIgJT4lIA0KICBmaWx0ZXIoeWVhciA9PSAyMDAyLCBjb250aW5lbnQgPT0gIkV1cm9wZSIpDQpgYGANCg0KLSAgIFVzZSBgJWluJWAgdG8gY2hlY2sgZm9yIGFueSBiZWluZyB0cnVlIChzaG9ydGN1dCB0byB1c2luZyBgfGAgcmVwZWF0ZWRseSB3aXRoIGA9PWApDQoNCmBgYHtyfQ0KZ2FwbWluZGVyICU+JSANCiAgZmlsdGVyKGNvdW50cnkgJWluJSBjKCJBcmdlbnRpbmEiLCAiQmVsZ2l1bSIsICJNZXhpY28iKSwNCiAgICAgICAgIHllYXIgJWluJSBjKDE5ODcsIDE5OTIpKQ0KYGBgDQoNCiMjIGBzdW1tYXJpemUoKWANCg0KLSAgIEFueSBudW1lcmljYWwgc3VtbWFyeSB0aGF0IHlvdSB3YW50IHRvIGFwcGx5IHRvIGEgY29sdW1uIG9mIGEgZGF0YSBmcmFtZSBpcyBzcGVjaWZpZWQgd2l0aGluIGBzdW1tYXJpemUoKWAuDQoNCmBgYHtyfQ0KbWF4X2V4cF8xOTk3IDwtIGdhcG1pbmRlciAlPiUgDQogIGZpbHRlcih5ZWFyID09IDE5OTcpICU+JSANCiAgc3VtbWFyaXplKG1heF9leHAgPSBtYXgobGlmZUV4cCkpDQptYXhfZXhwXzE5OTcNCmBgYA0KDQojIyMgQ29tYmluaW5nIGBzdW1tYXJpemUoKWAgd2l0aCBgZ3JvdXBfYnkoKWANCg0KV2hlbiB5b3UnZCBsaWtlIHRvIGRldGVybWluZSBhIG51bWVyaWNhbCBzdW1tYXJ5IGZvciBhbGwgbGV2ZWxzIG9mIGEgZGlmZmVyZW50IGNhdGVnb3JpY2FsIHZhcmlhYmxlDQoNCmBgYHtyfQ0KbWF4X2V4cF8xOTk3X2J5X2NvbnQgPC0gZ2FwbWluZGVyICU+JSANCiAgZmlsdGVyKHllYXIgPT0gMTk5NykgJT4lIA0KICBncm91cF9ieShjb250aW5lbnQpICU+JQ0KICBzdW1tYXJpemUobWF4X2V4cCA9IG1heChsaWZlRXhwKSkNCm1heF9leHBfMTk5N19ieV9jb250DQpgYGANCg0KIyMjIFdpdGhvdXQgdGhlIGAlPiVgDQoNCkl0J3MgaGFyZCB0byBhcHByZWNpYXRlIHRoZSBgJT4lYCB3aXRob3V0IHNlZWluZyB3aGF0IHRoZSBjb2RlIHdvdWxkIGxvb2sgbGlrZSB3aXRob3V0IGl0Og0KDQpgYGB7cn0NCm1heF9leHBfMTk5N19ieV9jb250IDwtIA0KICBzdW1tYXJpemUoDQogICAgZ3JvdXBfYnkoDQogICAgICBmaWx0ZXIoDQogICAgICAgIGdhcG1pbmRlciwgDQogICAgICAgICAgeWVhciA9PSAxOTk3KSwgDQogICAgICBjb250aW5lbnQpLA0KICAgIG1heF9leHAgPSBtYXgobGlmZUV4cCkpDQptYXhfZXhwXzE5OTdfYnlfY29udA0KYGBgDQoNCiMjIGBtdXRhdGUoKWANCg0KLSAgIEFsbG93cyB5b3UgdG8NCiAgICAxLiAgY3JlYXRlIGEgbmV3IHZhcmlhYmxlIGJhc2VkIG9uIG90aGVyIHZhcmlhYmxlcyBPUg0KICAgIDIuICBjaGFuZ2UgdGhlIGNvbnRlbnRzIG9mIGFuIGV4aXN0aW5nIHZhcmlhYmxlDQoNCjEuICBjcmVhdGUgYSBuZXcgdmFyaWFibGUgYmFzZWQgb24gb3RoZXIgdmFyaWFibGVzDQoNCmBgYHtyfQ0KZ2FwX3dfZ2RwIDwtIGdhcG1pbmRlciAlPiUgbXV0YXRlKGdkcCA9IHBvcCAqIGdkcFBlcmNhcCkNCmdhcF93X2dkcA0KYGBgDQoNCiMjIGBtdXRhdGUoKWANCg0KMy4gIGNoYW5nZSB0aGUgY29udGVudHMgb2YgYW4gZXhpc3RpbmcgdmFyaWFibGUNCg0KYGBge3J9DQpnYXBfd2VpcmQgPC0gZ2FwbWluZGVyICU+JSBtdXRhdGUocG9wID0gcG9wICsgMTAwMCkNCmdhcF93ZWlyZA0KYGBgDQoNCiMjIGBhcnJhbmdlKClgDQoNCi0gICBSZW9yZGVycyB0aGUgcm93cyBpbiBhIGRhdGEgZnJhbWUgYmFzZWQgb24gdGhlIHZhbHVlcyBvZiBvbmUgb3IgbW9yZSB2YXJpYWJsZXMNCg0KYGBge3J9DQpnYXBtaW5kZXIgJT4lDQogIGFycmFuZ2UoeWVhciwgY291bnRyeSkNCmBgYA0KDQotICAgQ2FuIGFsc28gcHV0IGludG8gZGVzY2VuZGluZyBvcmRlcg0KDQpgYGB7ciBkZXNjfQ0KZ2FwbWluZGVyICU+JQ0KICBmaWx0ZXIoeWVhciA+IDIwMDApICU+JQ0KICBhcnJhbmdlKGRlc2MobGlmZUV4cCkpDQpgYGANCg0KIyMgT3RoZXIgdXNlZnVsIGBkcGx5cmAgdmVyYnMNCg0KLSAgIGBzZWxlY3RgDQotICAgYHRvcF9uYA0KLSAgIGBzYW1wbGVfbmANCi0gICBgc2xpY2VgDQotICAgYGdsaW1wc2VgDQotICAgYHJlbmFtZWANCg0KIyMgWW91ciBUYXNrDQoNCkRldGVybWluZSB3aGljaCBBZnJpY2FuIGNvdW50cnkgaGFkIHRoZSBoaWdoZXN0IEdEUCBwZXIgY2FwaXRhIGluIDE5ODIgdXNpbmcgdGhlIGBnYXBtaW5kZXJgIGRhdGEgaW4gdGhlIGBnYXBtaW5kZXJgIHBhY2thZ2UuDQoNCmBgYHtyfQ0KbGlicmFyeShnYXBtaW5kZXIpDQpsaWJyYXJ5KGRwbHlyKQ0KDQpnYXBtaW5kZXIgJT4lDQogIGZpbHRlcih5ZWFyID09IDE5ODIsIGNvbnRpbmVudCA9PSAiQWZyaWNhIikgJT4lDQogIGFycmFuZ2UoZGVzYyhnZHBQZXJjYXApKQ0KDQpgYGANCkxpYnlhIGhhcyB0aGUgaGlnaGVzdCBHRFAgcGVyIGNhcGl0YSBpbiBBZnJpY2EgaW4gMTk4Mi4NCg0KKipDaGFsbGVuZ2UqKg0KDQpGb3IgYm90aCBvZiB0aGVzZSBwcm9ibGVtcyBiZWxvdywgdXNlIHRoZSBgYmVjaGRlbGAgZGF0YSBmcmFtZSBpbiB0aGUgYGZpdmV0aGlydHllaWdodGAgcGFja2FnZToNCg0KLSAgIFVzZSB0aGUgYGNvdW50YCBmdW5jdGlvbiBpbiB0aGUgYGRwbHlyYCBwYWNrYWdlIHRvIGRldGVybWluZSBob3cgbWFueSBtb3ZpZXMgaW4gMjAxMyBmZWxsIGludG8gZWFjaCBvZiB0aGUgZGlmZmVyZW50IGNhdGVnb3JpZXMgZm9yIGBjbGVhbl90ZXN0YA0KLSAgIERldGVybWluZSB0aGUgcGVyY2VudGFnZSBvZiBtb3ZpZXMgdGhhdCByZWNlaXZlZCB0aGUgdmFsdWUgb2YgYCJvayJgIGFjcm9zcyBhbGwgeWVhcnMNCg0KDQpgYGB7cn0NCmxpYnJhcnkoZHBseXIpDQpsaWJyYXJ5KGZpdmV0aGlydHllaWdodCkNCnZpZXcoYmVjaGRlbCkNCmJlY2hkZWwgJT4lDQogIGZpbHRlcih5ZWFyID09IDIwMTMpICU+JQ0KICBjb3VudChjbGVhbl90ZXN0KQ0KDQpgYGANCg0KYGBge3J9DQpiZWNoZGVsICU+JQ0KICBjb3VudChjbGVhbl90ZXN0KSAlPiUNCiAgbXV0YXRlKHBlcmNlbnRhZ2UgPSBuIC8gc3VtKG4pICogMTAwKSAlPiUNCiAgZmlsdGVyKGNsZWFuX3Rlc3QgPT0gIm9rIikNCg0KYGBgDQoNCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KDQojIyBZb3VyIFRhc2sNCg0KRGV0ZXJtaW5lIHRoZSB0b3AgZml2ZSBtb3ZpZXMgaW4gdGVybXMgb2YgZG9tZXN0aWMgcmV0dXJuIG9uIGludmVzdG1lbnQgZm9yIDIwMTMgc2NhbGVkIGRhdGEgdXNpbmcgdGhlIGBiZWNoZGVsYCBkYXRhIGZyYW1lIGluIHRoZSBgZml2ZXRoaXJ0eWVpZ2h0YCBwYWNrYWdlLg0KDQpgYGB7cn0NCmJlY2hkZWwgJT4lDQogIG11dGF0ZShkb21lc3RpY19yb2lfMjAxMyA9IGRvbWdyb3NzXzIwMTMgLyBidWRnZXRfMjAxMykgJT4lIA0KICBmaWx0ZXIoeWVhciA9PSAyMDEzKSAlPiUNCiAgYXJyYW5nZShkZXNjKGRvbWVzdGljX3JvaV8yMDEzKSkgJT4lDQogIHNlbGVjdCh0aXRsZSwgeWVhciwgZG9tZXN0aWNfcm9pXzIwMTMpDQpgYGANClRoZSB0b3AgNSBtb3ZpZXMgaW4gZG9tZXN0aWMgcmV0dXJuIG9uIGludmVzdG1lbnQgaW4gMjAxMyBpbiBvcmRlciBpcyBUaGUgUHVyZ2UsIEluc2lkaW91czogQ2hhcHRlciAyLCBUaGUgQ29uanVyaW5nLCBEZXNwaWNhYmxlIE1lIDIsIGFuZCBNYW1hLg==