meghna-answers.knit

library(tidyverse)

The mapping argument isn’t strictly necessary. If you look at the documentation for ggplot(), for instance, the first argument is data and the second argument is mapping. As long as you follow that order, you can feed the function those parameters in that exact order without using the argument names.

For instance, these two pieces of code do the same thing:

# With argument names
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

# Without argument names
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

If you use names, you can put the arguments in any order you want, like this:

# With strange order
ggplot(mapping = aes(x = displ, y = hwy), data = mpg) +
  geom_point()

With the diamonds %>% count(color) %>% ggplot(...) example, the reason y = count(diamonds, color) doesn’t work is that all the aesthetics that you set inside a ggplot plot generally need to be columns that already exist in a dataset. The count() function creates a column called n (technically count(color) is a shortcut for group_by(color) %>% summarize(n = n())), and you can then map that n column to the y aesthetic

It’s more obvious what’s going on if you store the summarized data frame as a separate object first and then plot it, rather than skipping right to the plotting:

diamond_color_counts <- diamonds %>%
  count(color)

# Look at diamond_color_counts in RStudio to see what it looks like

ggplot(diamond_color_counts, aes(x = color, y = n)) +
  geom_whatever()

Remember that group_by() splits the dataset up into invisible behind-the-scenes rows containing all the matching rows for each group. Grouping a second time regroups things.

Here’s a more practical example that you can run:

# Get a count of the types of drives across the two years. This should show 6 rows
mpg %>% 
  group_by(year, drv) %>% 
  summarize(num_cars = n())

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

## # A tibble: 6 × 3
## # Groups:   year [2]
##    year drv   num_cars
##   <int> <chr>    <int>
## 1  1999 4           49
## 2  1999 f           57
## 3  1999 r           11
## 4  2008 4           54
## 5  2008 f           49
## 6  2008 r           14

# Create a column that shows the proportion of cars. Without any other
# group_by() functions, R will peel off the last group (drv) and group the data
# by year (the first group), so all mutate()s will happen within each year,
# meaning that the prop_cars column will add to 100% within each year
mpg %>% 
  group_by(year, drv) %>% 
  summarize(num_cars = n()) %>% 
  mutate(prop_cars = num_cars / sum(num_cars))

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

## # A tibble: 6 × 4
## # Groups:   year [2]
##    year drv   num_cars prop_cars
##   <int> <chr>    <int>     <dbl>
## 1  1999 4           49    0.419 
## 2  1999 f           57    0.487 
## 3  1999 r           11    0.0940
## 4  2008 4           54    0.462 
## 5  2008 f           49    0.419 
## 6  2008 r           14    0.120

# If we explicitly tell it to group by drv again, it'll make it so that
# prop_cars adds to 100% within each drive
mpg %>% 
  group_by(year, drv) %>% 
  summarize(num_cars = n()) %>% 
  group_by(drv) %>% 
  mutate(prop_cars = num_cars / sum(num_cars))

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

## # A tibble: 6 × 4
## # Groups:   drv [3]
##    year drv   num_cars prop_cars
##   <int> <chr>    <int>     <dbl>
## 1  1999 4           49     0.476
## 2  1999 f           57     0.538
## 3  1999 r           11     0.44 
## 4  2008 4           54     0.524
## 5  2008 f           49     0.462
## 6  2008 r           14     0.56

# If we use ungroup(), there won't be any groups at all, so the prop_cars column
# will add to 100% overall
mpg %>% 
  group_by(year, drv) %>% 
  summarize(num_cars = n()) %>% 
  ungroup() %>% 
  mutate(prop_cars = num_cars / sum(num_cars))

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

## # A tibble: 6 × 4
##    year drv   num_cars prop_cars
##   <int> <chr>    <int>     <dbl>
## 1  1999 4           49    0.209 
## 2  1999 f           57    0.244 
## 3  1999 r           11    0.0470
## 4  2008 4           54    0.231 
## 5  2008 f           49    0.209 
## 6  2008 r           14    0.0598