Overview

Reference: https://stackoverflow.com/questions/19258460/standard-error-bars-using-stat-summary

Let’s look at the difference between 2 different ways of supplying functions to stat_summary:

  1. Binding the function (e.g. mean) to the argument fun.y

  2. Binding the function (e.g. mean_se or Hmisc::mean_cl_normal) to the argument fun.data

This could be a good way to fit pointwise confidence intervals without fitting a regression model to the entire dataset.

# ?stat_summary

Plot with 1 s.e. intervals around mean

This plot shows mean and 1 standard error intervals

mtcars %>% 
    ggplot(aes(as.factor(cyl),
               qsec)) + 
    
    # stat_summary with arg "fun.y":
    # A function that returns a single number 
    stat_summary(fun.y = mean, 
                 geom = "point") + 
    
    # stat_summary with arg "fun.data": 
    # A function that is given the complete data and should return a data frame
    # with variables ymin, y, and ymax (for use in plotting ranges).
    
    # mean_se( ) is intended for use with stat_summary. It calculates mean and 
    # standard error 
    stat_summary(fun.data = mean_se,  
                 geom = "errorbar") + 
    
    scale_y_continuous(limits = c(10, 30))

Plot with normmal-based CI around mean

Now use function mean_cl_normal from package {Hmisc}. This gives the sample mean and lower and upper Gaussian confidence limits based on the t-distribution

# ?mean_cl_normal

Here’s how the function works:

smean.cl.normal
## function (x, mult = qt((1 + conf.int)/2, n - 1), conf.int = 0.95, 
##     na.rm = TRUE) 
## {
##     if (na.rm) 
##         x <- x[!is.na(x)]
##     n <- length(x)
##     if (n < 2) 
##         return(c(Mean = mean(x), Lower = NA, Upper = NA))
##     xbar <- sum(x)/n
##     se <- sqrt(sum((x - xbar)^2)/n/(n - 1))
##     c(Mean = xbar, Lower = xbar - mult * se, Upper = xbar + mult * 
##         se)
## }
## <bytecode: 0x000000001e286ef0>
## <environment: namespace:Hmisc>
# plot 
mtcars %>% 
    ggplot(aes(as.factor(cyl),
               qsec)) + 
    
    # stat_summary with arg "fun.y":
    # A function that returns a single number 
    stat_summary(fun.y = mean, 
                 geom = "point") + 
    
    # stat_summary with arg "fun.data": 
    # A function that is given the complete data and should return a data frame
    # with variables ymin, y, and ymax (for use in plotting ranges).
    
    # mean_cl_normal( ) is intended for use with stat_summary. It calculates
    # sample mean and lower and upper Gaussian confidence limits based on the 
    # t-distribution
    stat_summary(fun.data = mean_cl_normal,  
                 geom = "errorbar") +
    
    scale_y_continuous(limits = c(10, 30))

Why not regression?

Consider the following. Is this a better way to show confidence intervals?

Ans. No, not if cylinder is not actually a continuous variable.

These are two different models:

  • if we treat cyl as continuous, then we estimate two parameters: slope and intercept

  • if we treat cyl as factor, then we estimate 3 parameters: intercept, effect of cyl = 6, and effect of cyl = 8

#’ Note that the CI will not be displayed if you pass cyl as a factor (2nd graph below)

# cyl as continuous
mtcars %>% 
    ggplot(aes(x = cyl,
               y = qsec)) + 
    geom_point() + 
    geom_smooth(method = "lm")

# cyl as factor
mtcars %>% 
    ggplot(aes(x = as.factor(cyl),
               y = qsec)) + 
    geom_point() + 
    geom_smooth(method = "lm")

Exploring alternative models

lm(qsec ~ cyl, 
   data = mtcars) %>% 
    tidy %>% 
    kableExtra::kable()
term estimate std.error statistic p.value
(Intercept) 21.5091530 0.9476718 22.696838 0.0000000
cyl -0.5915803 0.1473292 -4.015362 0.0003661
lm(qsec ~ as.factor(cyl), 
   data = mtcars) %>% 
    tidy %>% 
    kableExtra::kable()
term estimate std.error statistic p.value
(Intercept) 19.13727 0.4492501 42.598256 0.0000000
as.factor(cyl)6 -1.16013 0.7204029 -1.610390 0.1181454
as.factor(cyl)8 -2.36513 0.6003358 -3.939678 0.0004712