library(dplyr)
library(ggplot2)
library(readr)
library(lubridate)
Issues with geom_col
There appears to be some inconsistency with the data shown in geom_col
based on where it is published and the size of the window/viewport.
Read data
<- read_csv("data/geom_col_data.csv",
df col_select = c("dtime", "var"))
Distribution
The example data set is a time series (15min interval) with a variable that is mostly 0 with small events that have a range of (0, 1). This data is normally plotted with 72hr and 90 day series. I think the distribution is likely causing the issue, but the change in output is still concerning.
ggplot(df, aes(x = var)) +
geom_histogram(bins = 50) +
labs(
title = "Histogram of var",
x = "var",
y = "Count"
)
Plots with geom_col
When I create a basic column plot using geom_col
the height of a column and the scale seem to change independently with the change in window size.
<- ggplot(df, aes(x = dtime, y = var)) +
basic_col geom_col() +
labs(
title = "geom_col Plot with Inaccurate Column Heights"
) basic_col
The largest value for var
is 0.73, but the output above show it being >0.75. The screenshot below is from clicking the “Show in New Window” button, which matches the image in RStudio but is different than what is rendered in the Quarto HTML document.
If I resize the plot in the window to be longer, it looks like the largest value is now almost 1.5
I’m guessing this may have to do with the position = stack
, but it seems inconsistent.
Saving basic plot
ggsave("basic_col.png", basic_col)
Saving 7 x 5 in image
The saved plot is the the same as the default in RStudio.
Adding labels
If we add labels, we can clearly see the the columns extend past the actual values. The output shown below in the rendered Quarto doc is different than what is shown in RStudio as it shows the spike much closer to the beginning of May.
+ geom_text(label = df$var) basic_col
In case it was stacking values due to the scale of the x-axis, I tried reducing the date range to see if it would improve the results. It did not:
<- ggplot(df |> dplyr::filter(between(dtime,
narrow_col ymd_hms("2025-05-16 00:00:00"),
ymd_hms("2025-05-17 00:00:00"))),
aes(x = dtime, y = var)) +
geom_col() +
scale_x_datetime(date_breaks = "15 min",
#date_minor_breaks = "15 min",
date_labels = "%R") +
labs(
title = "Narrow geom_col Plot with Inaccurate Column Heights",
x = "Time (5/16/25)"
)
narrow_col
Comparing with geom_point
geom_point
looks to show the proper heights:
<- ggplot(df, aes(x = dtime, y = var)) +
basic_point geom_point() +
labs(
title = "geom_point Plot with Accurate Column Heights"
) basic_point
Adjusting scales
Using scale_y_continuous
does not seem to fix the issue:
<- ggplot(df, aes(x = dtime, y = var)) +
scale_col geom_col() +
scale_y_continuous(limits = c(0, 1)) +
labs(
title = "geom_col Plot with scale_y_continuous"
) scale_col
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_col()`).
It still thinks there is values greater than 1 in the data set.