library(tidyverse)
The mapping argument isn’t strictly necessary. If you look at the
documentation for ggplot(), for instance, the first argument is
data
and the second argument is mapping
. As
long as you follow that order, you can feed the function those
parameters in that exact order without using the argument names.
For instance, these two pieces of code do the same thing:
# With argument names
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()
# Without argument names
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
If you use names, you can put the arguments in any order you want, like this:
# With strange order
ggplot(mapping = aes(x = displ, y = hwy), data = mpg) +
geom_point()
With the diamonds %>% count(color) %>% ggplot(...)
example, the reason y = count(diamonds, color)
doesn’t work
is that all the aesthetics that you set inside a ggplot plot generally
need to be columns that already exist in a dataset. The
count()
function creates a column called n
(technically count(color)
is a shortcut for
group_by(color) %>% summarize(n = n())
), and you can
then map that n
column to the y aesthetic
It’s more obvious what’s going on if you store the summarized data frame as a separate object first and then plot it, rather than skipping right to the plotting:
diamond_color_counts <- diamonds %>%
count(color)
# Look at diamond_color_counts in RStudio to see what it looks like
ggplot(diamond_color_counts, aes(x = color, y = n)) +
geom_whatever()
Remember that group_by()
splits the dataset up into
invisible behind-the-scenes rows containing all the matching rows for
each group. Grouping a second time regroups things.
Here’s a more practical example that you can run:
# Get a count of the types of drives across the two years. This should show 6 rows
mpg %>%
group_by(year, drv) %>%
summarize(num_cars = n())
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 3
## # Groups: year [2]
## year drv num_cars
## <int> <chr> <int>
## 1 1999 4 49
## 2 1999 f 57
## 3 1999 r 11
## 4 2008 4 54
## 5 2008 f 49
## 6 2008 r 14
# Create a column that shows the proportion of cars. Without any other
# group_by() functions, R will peel off the last group (drv) and group the data
# by year (the first group), so all mutate()s will happen within each year,
# meaning that the prop_cars column will add to 100% within each year
mpg %>%
group_by(year, drv) %>%
summarize(num_cars = n()) %>%
mutate(prop_cars = num_cars / sum(num_cars))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## # Groups: year [2]
## year drv num_cars prop_cars
## <int> <chr> <int> <dbl>
## 1 1999 4 49 0.419
## 2 1999 f 57 0.487
## 3 1999 r 11 0.0940
## 4 2008 4 54 0.462
## 5 2008 f 49 0.419
## 6 2008 r 14 0.120
# If we explicitly tell it to group by drv again, it'll make it so that
# prop_cars adds to 100% within each drive
mpg %>%
group_by(year, drv) %>%
summarize(num_cars = n()) %>%
group_by(drv) %>%
mutate(prop_cars = num_cars / sum(num_cars))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## # Groups: drv [3]
## year drv num_cars prop_cars
## <int> <chr> <int> <dbl>
## 1 1999 4 49 0.476
## 2 1999 f 57 0.538
## 3 1999 r 11 0.44
## 4 2008 4 54 0.524
## 5 2008 f 49 0.462
## 6 2008 r 14 0.56
# If we use ungroup(), there won't be any groups at all, so the prop_cars column
# will add to 100% overall
mpg %>%
group_by(year, drv) %>%
summarize(num_cars = n()) %>%
ungroup() %>%
mutate(prop_cars = num_cars / sum(num_cars))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## year drv num_cars prop_cars
## <int> <chr> <int> <dbl>
## 1 1999 4 49 0.209
## 2 1999 f 57 0.244
## 3 1999 r 11 0.0470
## 4 2008 4 54 0.231
## 5 2008 f 49 0.209
## 6 2008 r 14 0.0598