Overview

Reference: https://stackoverflow.com/questions/19258460/standard-error-bars-using-stat-summary

Let’s look at the difference between 2 different ways of supplying functions to stat_summary:

Binding the function (e.g. mean) to the argument fun.y
Binding the function (e.g. mean_se or Hmisc::mean_cl_normal) to the argument fun.data

This could be a good way to fit pointwise confidence intervals without fitting a regression model to the entire dataset.

# ?stat_summary

Plot with 1 s.e. intervals around mean

This plot shows mean and 1 standard error intervals

mtcars %>% 
    ggplot(aes(as.factor(cyl),
               qsec)) + 
    
    # stat_summary with arg "fun.y":
    # A function that returns a single number 
    stat_summary(fun.y = mean, 
                 geom = "point") + 
    
    # stat_summary with arg "fun.data": 
    # A function that is given the complete data and should return a data frame
    # with variables ymin, y, and ymax (for use in plotting ranges).
    
    # mean_se( ) is intended for use with stat_summary. It calculates mean and 
    # standard error 
    stat_summary(fun.data = mean_se,  
                 geom = "errorbar") + 
    
    scale_y_continuous(limits = c(10, 30))

Plot with normmal-based CI around mean

Now use function mean_cl_normal from package {Hmisc}. This gives the sample mean and lower and upper Gaussian confidence limits based on the t-distribution

# ?mean_cl_normal

Here’s how the function works:

smean.cl.normal

## function (x, mult = qt((1 + conf.int)/2, n - 1), conf.int = 0.95, 
##     na.rm = TRUE) 
## {
##     if (na.rm) 
##         x <- x[!is.na(x)]
##     n <- length(x)
##     if (n < 2) 
##         return(c(Mean = mean(x), Lower = NA, Upper = NA))
##     xbar <- sum(x)/n
##     se <- sqrt(sum((x - xbar)^2)/n/(n - 1))
##     c(Mean = xbar, Lower = xbar - mult * se, Upper = xbar + mult * 
##         se)
## }
## <bytecode: 0x000000001e286ef0>
## <environment: namespace:Hmisc>

# plot 
mtcars %>% 
    ggplot(aes(as.factor(cyl),
               qsec)) + 
    
    # stat_summary with arg "fun.y":
    # A function that returns a single number 
    stat_summary(fun.y = mean, 
                 geom = "point") + 
    
    # stat_summary with arg "fun.data": 
    # A function that is given the complete data and should return a data frame
    # with variables ymin, y, and ymax (for use in plotting ranges).
    
    # mean_cl_normal( ) is intended for use with stat_summary. It calculates
    # sample mean and lower and upper Gaussian confidence limits based on the 
    # t-distribution
    stat_summary(fun.data = mean_cl_normal,  
                 geom = "errorbar") +
    
    scale_y_continuous(limits = c(10, 30))

Why not regression?

Consider the following. Is this a better way to show confidence intervals?

Ans. No, not if cylinder is not actually a continuous variable.

These are two different models:

if we treat cyl as continuous, then we estimate two parameters: slope and intercept
if we treat cyl as factor, then we estimate 3 parameters: intercept, effect of cyl = 6, and effect of cyl = 8

#’ Note that the CI will not be displayed if you pass cyl as a factor (2nd graph below)

# cyl as continuous
mtcars %>% 
    ggplot(aes(x = cyl,
               y = qsec)) + 
    geom_point() + 
    geom_smooth(method = "lm")

# cyl as factor
mtcars %>% 
    ggplot(aes(x = as.factor(cyl),
               y = qsec)) + 
    geom_point() + 
    geom_smooth(method = "lm")

Exploring alternative models

lm(qsec ~ cyl, 
   data = mtcars) %>% 
    tidy %>% 
    kableExtra::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	21.5091530	0.9476718	22.696838	0.0000000
cyl	-0.5915803	0.1473292	-4.015362	0.0003661

lm(qsec ~ as.factor(cyl), 
   data = mtcars) %>% 
    tidy %>% 
    kableExtra::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	19.13727	0.4492501	42.598256	0.0000000
as.factor(cyl)6	-1.16013	0.7204029	-1.610390	0.1181454
as.factor(cyl)8	-2.36513	0.6003358	-3.939678	0.0004712

stat_summary functions in ggplot

Nayef Ahmad

2019-11-03

Overview

Plot with 1 s.e. intervals around mean

Plot with normmal-based CI around mean

Why not regression?

Exploring alternative models