library(tidyverse)
library(forcats)
It’s a sample of data from the General Social Survey, which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago
head(gss_cat)
| year | marital | age | race | rincome | partyid | relig | denom | tvhours |
|---|---|---|---|---|---|---|---|---|
| 2000 | Never married | 26 | White | $8000 to 9999 | Ind,near rep | Protestant | Southern baptist | 12 |
| 2000 | Divorced | 48 | White | $8000 to 9999 | Not str republican | Protestant | Baptist-dk which | NA |
| 2000 | Widowed | 67 | White | Not applicable | Independent | Protestant | No denomination | 2 |
| 2000 | Never married | 39 | White | Not applicable | Ind,near rep | Orthodox-christian | Not applicable | 4 |
| 2000 | Divorced | 25 | White | Not applicable | Not str democrat | None | Not applicable | 1 |
| 2000 | Married | 25 | White | $20000 - 24999 | Strong democrat | Protestant | Southern baptist | NA |
gss_cat %>%
count(race)
| race | n |
|---|---|
| Other | 1959 |
| Black | 3129 |
| White | 16395 |
It’s often useful to change the order of the factor levels in a visualization
relig <- gss_cat %>%
group_by(relig) %>%
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()
2. You can use fct_relevel(). It takes a factor, f, and then any number of levels that you want to move to the front of the line:
rincome <- gss_cat %>%
group_by(rincome) %>%
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
) %>%
ggplot(
.,
aes(age, fct_relevel(rincome, "Not applicable"))
) +
geom_point()
rincome
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count() %>%
mutate(prop = n / sum(n))
ggplot(by_age, aes(age, n, color = marital)) +
geom_line(na.rm = TRUE)
ggplot(
by_age,
aes(age, n, color = fct_reorder2(marital, age, n))
) +
geom_line() +
labs(color = "marital")
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
count(partyid)
| partyid | n |
|---|---|
| No answer | 154 |
| Don’t know | 1 |
| Other party | 393 |
| Republican, strong | 2314 |
| Republican, weak | 3032 |
| Independent, near rep | 1791 |
| Independent | 4119 |
| Independent, near dem | 2499 |
| Democrat, weak | 3690 |
| Democrat, strong | 3490 |
If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode().
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) %>%
count(partyid)
| partyid | n |
|---|---|
| other | 548 |
| rep | 5346 |
| ind | 8409 |
| dem | 7180 |
Sometimes you just want to lump together all the small groups to make a plot or table simpler. That’s the job of fct_lump():
gss_cat %>%
mutate(relig = fct_lump(relig)) %>%
count(relig)
| relig | n |
|---|---|
| Protestant | 10846 |
| Other | 10637 |
The default behavior is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group we can use the n parameter to specify how many groups (excluding other) we want to keep:
gss_cat %>%
mutate(relig = fct_lump(relig, n = 10)) %>%
count(relig, sort = TRUE) %>%
print(n = Inf)
## # A tibble: 10 x 2
## relig n
## <fctr> <int>
## 1 Protestant 10846
## 2 Catholic 5124
## 3 None 3523
## 4 Christian 689
## 5 Other 458
## 6 Jewish 388
## 7 Buddhism 147
## 8 Inter-nondenominational 109
## 9 Moslem/islam 104
## 10 Orthodox-christian 95
The following toy example splits up the mtcars dataset into three pieces (one for each value of cylinder) and fits the same linear model to each piece
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))
The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut—a one-sided formula:
models <- mtcars %>%
split(.$cyl) %>%
map(~lm(mpg ~ wt, data = .))
When you’re looking at many models, you might want to extract a summary statistic like the R2
models %>%
map(summary) %>%
map_dbl("r.squared")
## 4 6 8
## 0.5086326 0.4645102 0.4229655
safely() is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:
result The original result. If there was an error, this will be NULL.
error An error object. If the operation was successful, this will be NULL.
safely() is designed to work with map:
x <- list(1, 10, "a")
y <- x %>% map(safely(log))
str(y)
## List of 3
## $ :List of 2
## ..$ result: num 0
## ..$ error : NULL
## $ :List of 2
## ..$ result: num 2.3
## ..$ error : NULL
## $ :List of 2
## ..$ result: NULL
## ..$ error :List of 2
## .. ..$ message: chr "non-numeric argument to mathematical function"
## .. ..$ call : language log(x = x, base = base)
## .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
This would be easier to work with if we had two lists: one of all the errors and one of all the output. That’s easy to get with purrr::transpose():
y <- y %>% transpose()
str(y)
## List of 2
## $ result:List of 3
## ..$ : num 0
## ..$ : num 2.3
## ..$ : NULL
## $ error :List of 3
## ..$ : NULL
## ..$ : NULL
## ..$ :List of 2
## .. ..$ message: chr "non-numeric argument to mathematical function"
## .. ..$ call : language log(x = x, base = base)
## .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
Typically you’ll either look at the values of x where y is an error, or work with the values of y that are OK:
is_ok <- y$error %>% map_lgl(is_null)
x[!is_ok]
## [[1]]
## [1] "a"
y$result[is_ok] %>% flatten_dbl()
## [1] 0.000000 2.302585
purrr provides two other useful adverbs:
x <- list(1, 10, "a")
x %>% map_dbl(possibly(log, NA_real_))
## [1] 0.000000 2.302585 NA
x <- list(1, -1)
x %>% map(quietly(log)) %>% str()
## List of 2
## $ :List of 4
## ..$ result : num 0
## ..$ output : chr ""
## ..$ warnings: chr(0)
## ..$ messages: chr(0)
## $ :List of 4
## ..$ result : num NaN
## ..$ output : chr ""
## ..$ warnings: chr "NaNs produced"
## ..$ messages: chr(0)
purrr provides pmap(), which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:
n <- list(1, 3, 5)
sigma <- list(1, 5, 10)
mu <- list(5, 10, -3)
args1 <- list(n, mu, sigma)
args1 %>%
pmap(rnorm) %>%
str()
## List of 3
## $ : num 6.5
## $ : num [1:3] 12.1 14.8 2.1
## $ : num [1:5] -2.01 -6.04 8.28 -11.52 -18.44
It’s better to name the arguments:
args2 <- list(mean = mu, sd = sigma, n = n)
args2 %>%
pmap(rnorm) %>%
str()
## List of 3
## $ : num 4.67
## $ : num [1:3] 10.32 11.79 5.33
## $ : num [1:5] -4.6 -9.06 -13.24 -24.45 1.08
There’s one more step up in complexity—as well as varying the arguments to the function you might also vary the function itself:
f <- c("runif", "rnorm", "rpois")
param <- list(
list(min = -1, max = 1),
list(sd = 5),
list(lambda = 10)
)
To handle this case, you can use invoke_map():
invoke_map(f, param, n = 5) %>% str()
## List of 3
## $ : num [1:5] 0.515 -0.313 0.288 -0.426 -0.397
## $ : num [1:5] -2.966 3.06 0.256 6.525 0.466
## $ : int [1:5] 6 16 12 13 12
The first argument is a list of functions or a character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function
And again, you can use tribble() to make creating these matching pairs a little easier:
sim <- tribble(
~f, ~params,
"runif", list(min = -1, max = 1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>%
mutate(sim = invoke_map(f, params, n = 10))
| f | params | sim |
|---|---|---|
| runif | -1, 1 | 0.02783315, 0.48112442, 0.95171522, 0.86166463, -0.22065507, -0.77405899, -0.63662948, 0.20527812, 0.96161927, 0.58387992 |
| rnorm | 5 | -2.290049, 1.476887, 2.409451, -3.556008, -8.998964, -4.962423, -3.024819, 2.587446, 4.171519, 2.285664 |
| rpois | 10 | 11, 9, 7, 10, 9, 17, 11, 15, 5, 11 |
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save files to disk
library(ggplot2)
plots <- mtcars %>%
split(.$cyl) %>%
map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- stringr::str_c(names(plots), ".pdf")
pwalk(list(paths, plots), ggsave, path = tempdir())
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
A number of functions work with predicate functions that return either a single TRUE or FALSE. keep() and discard() keep elements of the input where the predicate is TRUE or FALSE, respectively:
iris %>%
keep(is.factor) %>%
str()
## 'data.frame': 150 obs. of 1 variable:
## $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
iris %>%
discard(is.factor) %>%
str()
## 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
some() and every() determine if the predicate is true for any or for all of the elements:
x <- list(1:5, letters, list(10))
x %>%
some(is_character)
## [1] TRUE
x %>%
every(is_vector)
## [1] TRUE
detect() finds the first element where the predicate is true; detect_index() returns its position:
x <- sample(10)
x %>%
detect(~ . > 5)
## [1] 6
x %>%
detect_index(~ . > 5)
## [1] 1
head_while() and tail_while() take elements from the start or end of a vector while a predicate is true:
x %>%
head_while(~ . > 5)
## [1] 6 10 7
x %>%
tail_while(~ . > 5)
## [1] 8
Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton.
dfs <- list(
age = tibble(name = "John", age = 30),
sex = tibble(name = c("John", "Mary"), sex = c("M", "F")),
trt = tibble(name = "Mary", treatment = "A")
)
dfs %>% reduce(full_join)
## Joining, by = "name"
## Joining, by = "name"
| name | age | sex | treatment |
|---|---|---|---|
| John | 30 | M | NA |
| Mary | NA | F | A |
Or maybe you have a list of vectors, and want to find the intersection:
vs <- list(
c(1, 3, 5, 6, 10),
c(1, 2, 3, 7, 8, 10),
c(1, 2, 3, 4, 8, 9, 10)
)
vs %>% reduce(intersect)
## [1] 1 3 10
The reduce function takes a “binary” function (i.e., a function with two primary inputs), and applies it repeatedly to a list until there is only a single element left.
Accumulate is similar but it keeps all the interim results. You could use it to implement a cumulative sum:
x %>% accumulate(`+`)
## [1] 6 16 23 24 29 38 42 45 47 55
library(tidyverse)
library(modelr)
options(na.action = na.warn)