If you have information about the uncertainty present in your data, whether it be from a model or from distributional assumptions, it’s a good idea to display it. There are four basic families of geoms that can be used for this job, depending on whether the x values are discrete or continuous, and whether or not you want to display the middle of the interval, or just the extent:
Discrete x, range: geom_errorbar(), geom_linerange() Discrete x, range & center: geom_crossbar(), geom_pointrange() Continuous x, range: geom_ribbon() Continuous x, range & center: geom_smooth(stat = “identity”) These geoms assume that you are interested in the distribution of y conditional on x and use the aesthetics ymin and ymax to determine the range of the y values.
library(ggplot2)
y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 1.0))
base <- ggplot(df, aes(x, y, ymin = y - se, ymax = y + se))
base + geom_crossbar()
base + geom_pointrange()
base + geom_smooth(stat = "identity")
base + geom_errorbar()
base + geom_linerange()
base + geom_ribbon()
# Weighted data When you have aggregated data where each row in the
dataset represents multiple observations, you need some way to take into
account the weighting variable. We will use some data collected on
Midwest states in the 2000 US census in the built-in midwest data frame.
The data consists mainly of percentages (e.g., percent white, percent
below poverty line, percent with college degree) and some information
for each county (area, total population, population density).
# Unweighted
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point()
# Weight by population
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))
For more complicated geoms which involve some statistical
transformation, we specify weights with the weight aesthetic. These
weights will be passed on to the statistical summary function. Weights
are supported for every case where it makes sense: smoothers, quantile
regressions, boxplots, histograms, and density plots. You can’t see this
weighting variable directly, and it doesn’t produce a legend, but it
will change the results of the statistical summary. The following code
shows how weighting by population density affects the relationship
between percent white and percent below the poverty line.
# Unweighted
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point() +
geom_smooth(method = lm, linewidth = 1)
## `geom_smooth()` using formula = 'y ~ x'
#> `geom_smooth()` using formula = 'y ~ x'
# Weighted by population
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
geom_smooth(aes(weight = poptotal), method = lm, linewidth = 1) +
scale_size_area(guide = "none")
## `geom_smooth()` using formula = 'y ~ x'
#> `geom_smooth()` using formula = 'y ~ x'
When we weight a histogram or density plot by total population, we change from looking at the distribution of the number of counties, to the distribution of the number of people. The following code shows the difference this makes for a histogram of the percentage below the poverty line:
ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(binwidth = 1) +
ylab("Counties")
ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(aes(weight = poptotal), binwidth = 1) +
ylab("Population (1000s)")