library(ggplot2)
This is a simple bar plot showing how many diamonds we have for every diamond cut:
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))
With geom_bar
you are saying “I want a bar plot” and you are implying “I want to count how many samples I have of each type”. ggplot
internals separate the two concepts in stat_count
and geom_bar
, where the first takes care of counting how many samples your data have of each type and the second takes care of giving the coordinates of the rectangles that give the shape to each bar.
This separation of geom_
and stat_
is present in all of ggplot. For instance, geom_histogram
is very similar to a geom_bar
, but uses stat_bin
instead, to put samples into bins and then count the number of samples in each bin.
stat_count
provides two internal variables ..count..
and ..prop..
, referring to count and proportion respectively. Don’t be surprised by the ..name..
notation, it is used to prevent confusion with your own columns (don’t name your own columns with weird names like ..count..
!)
You can create the same plot giving some extra information that ggplot assumes by default:
y = ..count..
tells ggplot to use the count
variable from stat_count()
as the height of each bar.stat = "count"
tells ggplot to use stat_count
(count samples of each type)ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..count..), stat = "count")
When working with custom stat_
it is often useful to extract or understand the internal counting or computation. ggplot allows us to do that:
plt <- ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
plt_b <- ggplot_build(plt)
plt_b$data[[1]]
y | count | prop | x | PANEL | group | ymin | ymax | xmin | xmax | colour | fill | size | linetype | alpha |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1610 | 1610 | 1 | 1 | 1 | 1 | 0 | 1610 | 0.55 | 1.45 | NA | grey35 | 0.5 | 1 | NA |
4906 | 4906 | 1 | 2 | 1 | 2 | 0 | 4906 | 1.55 | 2.45 | NA | grey35 | 0.5 | 1 | NA |
12082 | 12082 | 1 | 3 | 1 | 3 | 0 | 12082 | 2.55 | 3.45 | NA | grey35 | 0.5 | 1 | NA |
13791 | 13791 | 1 | 4 | 1 | 4 | 0 | 13791 | 3.55 | 4.45 | NA | grey35 | 0.5 | 1 | NA |
21551 | 21551 | 1 | 5 | 1 | 5 | 0 | 21551 | 4.55 | 5.45 | NA | grey35 | 0.5 | 1 | NA |
Here one can see the count
and prop
columns. The prop
column is created as count
divided by the sum of all of the count
that belong to the same group
. By default, ggplot created one group
per each bar, so all the proportions are set to 1.
When we try to transform the counts into percentages we should use ..prop..
as y
variable, but we will fail if we don’t provide a group:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..), stat = "count")
However if we provide a group, stat_count
will compute the proportions as we want:
plt <- ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
plt_b <- ggplot_build(plt)
plt_b$data[[1]]
y | count | prop | x | group | PANEL | ymin | ymax | xmin | xmax | colour | fill | size | linetype | alpha |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.0298480 | 1610 | 0.0298480 | 1 | 1 | 1 | 0 | 0.0298480 | 0.55 | 1.45 | NA | grey35 | 0.5 | 1 | NA |
0.0909529 | 4906 | 0.0909529 | 2 | 1 | 1 | 0 | 0.0909529 | 1.55 | 2.45 | NA | grey35 | 0.5 | 1 | NA |
0.2239896 | 12082 | 0.2239896 | 3 | 1 | 1 | 0 | 0.2239896 | 2.55 | 3.45 | NA | grey35 | 0.5 | 1 | NA |
0.2556730 | 13791 | 0.2556730 | 4 | 1 | 1 | 0 | 0.2556730 | 3.55 | 4.45 | NA | grey35 | 0.5 | 1 | NA |
0.3995365 | 21551 | 0.3995365 | 5 | 1 | 1 | 0 | 0.3995365 | 4.55 | 5.45 | NA | grey35 | 0.5 | 1 | NA |
And the plot will be what we expect:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1), stat = "count")
We can finally tweak the labels so they are expressed as percentages:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1), stat = "count") +
scale_y_continuous(labels = scales::percent_format())