Reference: https://stackoverflow.com/questions/19258460/standard-error-bars-using-stat-summary
Let’s look at the difference between 2 different ways of supplying functions to stat_summary:
Binding the function (e.g. mean) to the argument fun.y
Binding the function (e.g. mean_se or Hmisc::mean_cl_normal) to the argument fun.data
This could be a good way to fit pointwise confidence intervals without fitting a regression model to the entire dataset.
# ?stat_summary
This plot shows mean and 1 standard error intervals
mtcars %>%
ggplot(aes(as.factor(cyl),
qsec)) +
# stat_summary with arg "fun.y":
# A function that returns a single number
stat_summary(fun.y = mean,
geom = "point") +
# stat_summary with arg "fun.data":
# A function that is given the complete data and should return a data frame
# with variables ymin, y, and ymax (for use in plotting ranges).
# mean_se( ) is intended for use with stat_summary. It calculates mean and
# standard error
stat_summary(fun.data = mean_se,
geom = "errorbar") +
scale_y_continuous(limits = c(10, 30))
Now use function mean_cl_normal from package {Hmisc}. This gives the sample mean and lower and upper Gaussian confidence limits based on the t-distribution
# ?mean_cl_normal
Here’s how the function works:
smean.cl.normal
## function (x, mult = qt((1 + conf.int)/2, n - 1), conf.int = 0.95,
## na.rm = TRUE)
## {
## if (na.rm)
## x <- x[!is.na(x)]
## n <- length(x)
## if (n < 2)
## return(c(Mean = mean(x), Lower = NA, Upper = NA))
## xbar <- sum(x)/n
## se <- sqrt(sum((x - xbar)^2)/n/(n - 1))
## c(Mean = xbar, Lower = xbar - mult * se, Upper = xbar + mult *
## se)
## }
## <bytecode: 0x000000001e286ef0>
## <environment: namespace:Hmisc>
# plot
mtcars %>%
ggplot(aes(as.factor(cyl),
qsec)) +
# stat_summary with arg "fun.y":
# A function that returns a single number
stat_summary(fun.y = mean,
geom = "point") +
# stat_summary with arg "fun.data":
# A function that is given the complete data and should return a data frame
# with variables ymin, y, and ymax (for use in plotting ranges).
# mean_cl_normal( ) is intended for use with stat_summary. It calculates
# sample mean and lower and upper Gaussian confidence limits based on the
# t-distribution
stat_summary(fun.data = mean_cl_normal,
geom = "errorbar") +
scale_y_continuous(limits = c(10, 30))
Consider the following. Is this a better way to show confidence intervals?
Ans. No, not if cylinder is not actually a continuous variable.
These are two different models:
if we treat cyl as continuous, then we estimate two parameters: slope and intercept
if we treat cyl as factor, then we estimate 3 parameters: intercept, effect of cyl = 6, and effect of cyl = 8
#’ Note that the CI will not be displayed if you pass cyl as a factor (2nd graph below)
# cyl as continuous
mtcars %>%
ggplot(aes(x = cyl,
y = qsec)) +
geom_point() +
geom_smooth(method = "lm")
# cyl as factor
mtcars %>%
ggplot(aes(x = as.factor(cyl),
y = qsec)) +
geom_point() +
geom_smooth(method = "lm")
lm(qsec ~ cyl,
data = mtcars) %>%
tidy %>%
kableExtra::kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 21.5091530 | 0.9476718 | 22.696838 | 0.0000000 |
| cyl | -0.5915803 | 0.1473292 | -4.015362 | 0.0003661 |
lm(qsec ~ as.factor(cyl),
data = mtcars) %>%
tidy %>%
kableExtra::kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 19.13727 | 0.4492501 | 42.598256 | 0.0000000 |
| as.factor(cyl)6 | -1.16013 | 0.7204029 | -1.610390 | 0.1181454 |
| as.factor(cyl)8 | -2.36513 | 0.6003358 | -3.939678 | 0.0004712 |