coord_cartesian
to zoom, not xlim
, ylim
, or scale_*
!Why? Because the xlim
, ylim
, and scale_
commands remove data points, whereas the coord_cartesian
command simply zooms the plots. This causes trouble when ggplot2
is using one of the stat_
functions to compute something (such as a smoothed fit to data, a density, or a contour) from the underlying data before plotting.
This is all explained in the ggplot2
docs and book, and so in theory everyone should know it. But in practice I’ve used ggplot2
for years and somehow managed to either not read or ignore that part of the documentation, and so never realized this. I asked some colleagues and they hadn’t either, which is why I thought it was worth the time to write up a simple example of how much trouble this can cause if you’re not careful.
Here’s an example where we’ll generate 500 random points with a (weak) linear relationship between them and use geom_smooth
to fit a linear function to them.
set.seed(42)
x <- rnorm(500)
y <- 2*x + 25*rnorm(500)
df <- data.frame(x, y)
ggplot(df, aes(x=x, y=y)) +
geom_smooth(method="lm")
This looks pretty good—the fitted line shows a positive correlation between x and y and has just about the right slope (of approximately 2).
Let’s try something seemingly harmless, and just center the y-axis around 0 from -12 to 12 using the ylim
function.
ggplot(df, aes(x=x, y=y)) +
geom_smooth(method="lm") +
ylim(c(-12,12))
## Warning: Removed 326 rows containing non-finite values (stat_smooth).
What’s going on here? Why did setting the y-axis limits on the plot change the slope of the fitted line and reverse the direction of the apparent correlation between x and y?
This happens because of the order of operations involved in making the plot. When you call a command like geom_smooth
in combination with ylim
, ylim
first filters the data to the specified range, then stat_smooth
is called behind the scenes to fit the line, and, finally, the plot is displayed. So the slope of the fitted line changes because it’s fit to different data!
Now, to be fair, stat_smooth
warns you about removing these values, but it’s very easy to overlook this warning and make some dangerous plotting errors if you’re not careful!
The same thing happens when you use xlim
, scale_x_continuous
, etc.
Here’s the right way to do things using coord_cartesian
, which doesn’t eliminate any data points but simply zooms the plot to the desired region.
ggplot(df, aes(x=x, y=y)) +
geom_smooth(method="lm") +
coord_cartesian(ylim=c(-12,12))
This plot has the right slope, and so all is well again.
The lesson again, in case you missed it: Use coord_cartesian
to zoom, not xlim
, ylim
, or scale_*
!