library(ggplot2)
library(ggtrace)
library(palmerpenguins)
library(patchwork)
Suppose a facetted layout with 3 panels, a discrete x scale and continuous y scale. The distribution of x-categories are uneven (“sparse”) across panels.
facetted_sparseX_plot <- ggplot(penguins, aes(species)) +
facet_grid(~ island, scales = "free_x", space = "free")
facetted_sparseX_plot
And suppose you want to plot the proportion of each penguin species
(x) within each island (facet). Using after_stat(prop) just
gets you a constant prop = 1 but you can achieve this
within-facet normalization with an additional
group = 1:
(facetted_sparseX_plot +
geom_bar(aes(y = after_stat(prop)))) +
(facetted_sparseX_plot +
geom_bar(aes(y = after_stat(prop), group = 1)))
Problem is that you can’t additionally color code by species using
the fill aesthetic - it gives a cryptic warning:
problem_plot <- facetted_sparseX_plot +
geom_bar(aes(y = after_stat(prop), group = 1, fill = species))
problem_plot
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Note that a minimal reprex doesn’t need the faceting part, but I’m using that more complex example for demo purposes:
ggplot(penguins, aes(x = species)) +
geom_bar(aes(y = after_stat(prop), group = 1, fill = species))
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
There are lots of moving parts to this, so I’m walking through my thought process step by step.
The first thing to consider is of course whether it’s intended to use
something like
aes(y = after_stat(prop), fill = species, group = 1) for
geom_bar().
ggplot docs for geom_bar/stat_count have no
example of after_stat(prop) but there is a clear parallel
in the docs for (the confusingly named) geom_count, where
the docs use the conjunction of after_stat(prop) and
group = 1 to achieve the effect of normalizing within facet
(vs. group)
d <- ggplot(diamonds, aes(x = cut, y = clarity))
geom_count_ex1 <- d +
geom_count(aes(size = after_stat(prop)))
geom_count_ex2 <- d +
geom_count(aes(size = after_stat(prop), group = 1)) +
scale_size_area(max_size = 10)
geom_count_ex1 + geom_count_ex2
Modifying this geom_count() code with something like
aes(fill = <x-var>) is not documented but works like
as one would expect:
d +
geom_count(aes(size = after_stat(prop), group = 1, fill = cut), shape = 21) +
scale_size_area(max_size = 10)
So it seems like people should be able to write code like
geom_bar() code with
aes(group = 1, fill = <some-discrete-var>. If not,
then we need to be careful (e.g., write better error msg) about the fact
that people will pick up this pattern from seeing how
geom_count() works, for example.
Assuming this is indeed a bug, here’s an investigation of why:
This was my first guess given the warning message.
But curiously, you get this warning even when you swap out
fill = with label =. We shouldn’t see a
grouping-related warning since the label aesthetic is
ignored as a possible grouping variable at the
ggplot2:::add_group level, but we see the same warning
again:
facetted_sparseX_plot +
aes(y = after_stat(prop), group = 1) +
geom_bar() +
stat_count(geom = "label", aes(label = species))
## Warning: The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 4 rows containing missing values (`geom_label()`).
So even if this isn’t a bug, it’s at best an incorrect/uninformative
warning because you’re hard-coding the grouping structure with
aes(group = 1), and the plot does correctly
reflect this fact (it just gets the non-grouping aesthetic wrong for
some reason). Furthermore, the fill aesthetic actually
does not get dropped contrary to what the warning suggests -
fill persists in the layer_data where you just get a
partial fill scale for the third panel:
layer_data(problem_plot)
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## fill y count prop x flipped_aes group PANEL ymin ymax
## 1 grey50 0.2619048 44 0.2619048 1 FALSE 1 1 0 0.2619048
## 2 grey50 0.7380952 124 0.7380952 2 FALSE 1 1 0 0.7380952
## 3 grey50 0.4516129 56 0.4516129 1 FALSE 1 2 0 0.4516129
## 4 grey50 0.5483871 68 0.5483871 2 FALSE 1 2 0 0.5483871
## 5 #F8766D 1.0000000 52 1.0000000 1 FALSE 1 3 0 1.0000000
## xmin xmax colour linewidth linetype alpha
## 1 0.55 1.45 NA 0.5 1 NA
## 2 1.55 2.45 NA 0.5 1 NA
## 3 0.55 1.45 NA 0.5 1 NA
## 4 1.55 2.45 NA 0.5 1 NA
## 5 0.55 1.45 NA 0.5 1 NA
Unfortunately, it comes down to Stat$compute_panel and
since parent ggproto methods have a big inertia to being edited, idk if
it’s worth “fixing” at all. In any case, the problem boils down to how
“non_constant_columns” are handled:
(warning suppressed from this point on)
Consider the data for the first panel at the stat compute step. What we want is for the panel-level stat data to look something like this, where the fill column retains the “x-to-fill” correspondence:
count prop x ... group fill PANEL
44 ... 1 ... 1 Adelie 1
124 ... 2 ... 1 Gentoo 1
But here’s what happens instead inside $compute_panel.
First it sends off the data to $compute_group for the
split-apply-combine, and then it identifies whether there are “non
constant columns” from the combined data. These happen in two assignment
calls to the internal stats variable.
ggbody(Stat$compute_panel)[c(4,6)]
## [[1]]
## stats <- lapply(groups, function(group) {
## self$compute_group(data = group, scales = scales, ...)
## })
##
## [[2]]
## stats <- mapply(function(new, old) {
## if (empty(new))
## return(data_frame0())
## old <- old[, !(names(old) %in% names(new)), drop = FALSE]
## non_constant <- vapply(old, function(x) length(unique0(x)) >
## 1, logical(1L))
## non_constant_columns <<- c(non_constant_columns, names(old)[non_constant])
## vec_cbind(new, old[rep(1, nrow(new)), , drop = FALSE])
## }, stats, groups, SIMPLIFY = FALSE)
In the case of the first panel of problem_plot, the
group-level split-apply-combine works fine and calculates the necessary
stat variables, but then in the clean-up stage, fill is
recognized as a non-constant variable (and forced to repeat just the
first value, “Adelie”):
ggtrace_inspect_vars(problem_plot, Stat$compute_panel, vars = "stats")
## $Step5
## $Step5$`1`
## count prop x width flipped_aes
## 1 44 0.2619048 1 0.9 FALSE
## 2 124 0.7380952 2 0.9 FALSE
##
##
## $Step7
## $Step7$`1`
## count prop x width flipped_aes group fill PANEL
## 1 44 0.2619048 1 0.9 FALSE 1 Adelie 1
## 1.1 124 0.7380952 2 0.9 FALSE 1 Adelie 1
Once fill is flagged as a non constant column, it gets
dropped from the data right before the data is returned from
$compute_panel:
ggbody(Stat$compute_panel)[[length(ggbody(Stat$compute_panel))]]
## data_new[, !names(data_new) %in% non_constant_columns, drop = FALSE]
In contrast, data for the third panel works because there’s only 1
value, so the fill column cannot be “non constant” by
definition. This is why the fill column is dropped by
$compute_panel for the first panel (and the second panel
too) but retained in the last:
ggtrace_inspect_return(problem_plot, Stat$compute_panel, cond = 1)
## count prop x width flipped_aes group PANEL
## 1 44 0.2619048 1 0.9 FALSE 1 1
## 1.1 124 0.7380952 2 0.9 FALSE 1 1
ggtrace_inspect_return(problem_plot, Stat$compute_panel, cond = 3)
## count prop x width flipped_aes group fill PANEL
## 1 52 1 1 0.9 FALSE 1 Adelie 3
Then when Stat$compute_layer combines the panel-level
data, you get a bunch of NA values in fill,
which causes the visual bug down the line.
One way is to give StatCount it’s own
$compute_panel method. This is actually precisely why this
behavior doesn’t replicate for geom_count(): its default
StatSum actually extends $compute_panel
instead of $compute_group to handle this more
graciously:
StatSum$compute_panel
## <ggproto method>
## <Wrapper function>
## function (...)
## compute_panel(...)
##
## <Inner function (f)>
## function (data, scales)
## {
## if (is.null(data$weight))
## data$weight <- 1
## group_by <- setdiff(intersect(names(data), ggplot_global$all_aesthetics),
## "weight")
## counts <- count(data, group_by, wt_var = "weight")
## counts <- rename(counts, c(freq = "n"))
## counts$prop <- stats::ave(counts$n, counts$group, FUN = prop.table)
## counts
## }
So a compromise to editing Stat$compute_panel could be
that for StatCount, count is calculated at the
level of $compute_group but prop is calculated
at the level of $compute_panel. This design is actually not
new - StatYdensity extends both compute methods to
“calculate” by group and “normalize” by panel.
But this still needs to happen earlier than the “non constant column”
check. One option is to remove that check entirely, although unlike
StatSum, StatCount does drop an
aesthetic ("weight").
Here’s one attempt that minimizes changes in the code. Specifically,
I edit just the mapply() portion of the
Stat$compute_panel for a new
StatCount2$compute_panel:
Stat$compute_panel’s mapply() line:
stats <- mapply(function(new, old) {
if (empty(new)) return(data_frame0())
old <- old[, !(names(old) %in% names(new)), drop = FALSE]
non_constant <- vapply(old, function(x) length(unique0(x)) > 1, logical(1L))
non_constant_columns <<- c(non_constant_columns, names(old)[non_constant])
vec_cbind(
new,
old[rep(1, nrow(new)), , drop = FALSE]
)
}, stats, groups, SIMPLIFY = FALSE)
StatCount2$compute_panel’s mapply() line:
I have no idea about how you’re supposed to do low-level things with vctrs and things - probably needs refactoring but this is functional.
stats <- mapply(function(new, old) {
if (empty(new)) return(data_frame0())
# Edit 1: Identify exceptions where a variable is constant *within values of x*
within_x_constant <- vapply(old, function(x) nlevels(interaction(x, old$x, drop = TRUE)) == max(old$x), logical(1L))
# Edit 2: Columns to keep
kept <- unique(old[, within_x_constant])
# Edit 3: Don't test non-constant-ness of `kept` columns
old <- old[, !(names(old) %in% c(names(new), names(kept))), drop = FALSE]
non_constant <- vapply(old, function(x) length(unique0(x)) > 1, logical(1L))
non_constant_columns <<- c(non_constant_columns, names(old)[non_constant])
# Edit 4: Add `kept` columns
vec_cbind(
new,
kept[, setdiff(names(kept), "x"), drop = FALSE],
old[rep(1, nrow(new)), setdiff(names(old), names(kept)), drop = FALSE],
)
}, stats, groups, SIMPLIFY = FALSE)
StatCount2 extension:
StatCount2 <- ggproto(
"StatCount2", StatCount,
compute_panel = function (self, data, scales, ...) {
if (ggplot2:::empty(data))
return(ggplot2:::data_frame0())
groups <- split(data, data$group)
stats <- lapply(groups, function(group) {
self$compute_group(data = group, scales = scales, ...)
})
non_constant_columns <- character(0)
stats <- mapply(function(new, old) {
if (ggplot2:::empty(new)) return(ggplot2:::data_frame0())
# Edit 1: Identify exceptions where a variable is constant *within values of x*
within_x_constant <- vapply(old, function(x) nlevels(interaction(x, old$x, drop = TRUE)) == max(old$x), logical(1L))
# Edit 2: Columns to keep
kept <- unique(old[, within_x_constant])
# Edit 3: Don't test non-constant-ness of `kept` columns
old <- old[, !(names(old) %in% c(names(new), names(kept))), drop = FALSE]
non_constant <- vapply(old, function(x) length(unique0(x)) > 1, logical(1L))
non_constant_columns <<- c(non_constant_columns, names(old)[non_constant])
# Edit 4: Add `kept` columns
vctrs::vec_cbind(
new,
kept[, setdiff(names(kept), "x"), drop = FALSE],
old[rep(1, nrow(new)), setdiff(names(old), names(kept)), drop = FALSE],
)
}, stats, groups, SIMPLIFY = FALSE)
non_constant_columns <- ggplot2:::unique0(non_constant_columns)
dropped <- non_constant_columns[!non_constant_columns %in% self$dropped_aes]
if (length(dropped) > 0) {
cli::cli_warn(c(
"The following aesthetics were dropped during statistical transformation: {.field {glue_collapse(dropped, sep = ', ')}}",
"i" = "This can happen when ggplot fails to infer the correct grouping structure in the data.",
"i" = "Did you forget to specify a {.code group} aesthetic or to convert a numerical variable into a factor?"
))
}
# Finally, combine the results and drop columns that are not constant.
data_new <- ggplot2:::vec_rbind0(!!!stats)
data_new[, !names(data_new) %in% non_constant_columns, drop = FALSE]
}
)
Check that StatCount2 behaves as it should with a new
$compute_panel:
facetted_sparseX_plot +
geom_bar(
aes(y = after_stat(prop), group = 1, fill = species),
stat = "count2"
)