Confidence Interval

December 7, 2015

####################################################################################################

There is a lot of debate about statistics based on the frequentist approach (i.e., the p-value).

Many critics show a poor understanding of how the frequentist inference works.

Such a poor understanding is a reflection of poor teaching on the topic.

One of the concepts that is often misunderstood is the Confidence Interval.

In the site of the GraphPad Statistics software there is this definition: “A 95% confidence interval is a range of values that you can be 95% certain contains the true mean of the population”. http://www.graphpad.com/guides/prism/6/statistics/index.htm?stat_more_about_confidence_interval.htm

This is not correct.

Another, operational definition is provided by Good and Hardin, in their excellent “Common Errors in Statistics (and How to Avoid Them)”, Wiley & Sons, Inc., Hoboken, New Jersey (2003), p. 101: “If we derive a large number of 95% confidence intervals, we can expect the true value of the parameter to be included in the computed intervals 95% of the time. (That is, the true values will be included if the assumptions on which the tests and confidence intervals are based are satisfied 100% of the time)”.

Essentially, if you conduct separate data analyses of repeated, independent experiments on samples drawn from the same population (as it is expected within the frequentist approach), you’ll find that the confidence interval (CI) which is calculated from each experiment is expected to include the population parameter around 95% of times (if you are calculating a 95% CI).

However, if a confidence interval will include the population parameter 95% of time in repeated testing, a single confidence interval has no probability attached to it. If we estimate a 95%CI from a single sample, we may be unfortunate enough to have caught the sample which is biased by random sampling error, and the derived CI is among the 5% that does not contain the population parameter.

Indeed, it is very important to realize that “In interpreting a confidence interval based on a test of significance, […] the center of the interval is no more likely than any other value, and the confidence to be placed in the interval is no greater than the confidence we have in the experimental design and statistical test it is based upon”. (Good & Hardin, cit., 2013).

As a matter of fact, “Frequentist CI theory says nothing at all about the probability that a particular, observed confidence interval contains the true value; it is either 0 (if the interval does not contain the parameter) or 1 (if the interval does contain the true value)”. (Morey, R., Hoekstra, R., Rouder, J., Lee, M., & Wagenmakers, E.-J. (2015). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 121. http://doi.org/10.3758/s13423-015-0947-8 See: https://learnbayes.org/papers/confidenceIntervalsFallacy/

The simplest method to calculate the confidence interval for a continuous variable is to use the standard error (SE).

The standard error is a measures of the amount of variability in the sample mean.

It indicates how closely the sample mean can be used to estimate the population mean.

SE (mean) = Standard deviation / sqrt(n), where n is the sample size.

In this case, the calculation of confidence interval for a mean = (mean + (1.96 x SE)) to (mean (1.96 x SE))

Th 1.96 is the 2-sides 5% point of the standard normal distribution.

When you have to calculate the CI of a proportion you can use this formula:

SE (proportion) = sqrt [(p (1 - p)) / n], where p is the proportion.

In this case, the calculation of confidence interval for a proportion = (proportion + (1.96 x SE)) to (proportion - (1.96 x SE))

However, there are different methods to calculate the CI for a proportion. For details, see: http://epitools.ausvet.com.au/content.php?page=CIProportion

There are also different parametric and non-parametric methods to calculate the CI for a mean.

In this exercise I will use a method based on the t-distribution.

The code below can be used to derive a confidence interval for a mean of a sample.

####################################################################################################################################
###
### Calculation of 95% confidence interval for a mean
###
### We will have 100 simulation for a sample with 100 cases from a population with mean = 25 and standatd deviation = 5
###
####################################################################################################################################


### set seed for reproducibility

set.seed(345)

### creation of an empty dataframe 

data <- data.frame()

####################################################################################################################################
### set the values for the simulation on 100 cases
####################################################################################################################################

N <- 100    ### number of simulations
n.a <- 100      ### sample size in group 
m.a <- 25       ### mean in the population from which is sampled the group 
sd.a <- 5       ### standard deviation in the population from which is sampled the group 



####################################################################################################################################
### function for simulation
####################################################################################################################################

for (i in 1:N) { 

    ### creation of random variable from a population with mean = m.a and sd = sd.a
    x <- rnorm(n.a, m.a, sd.a)
    
    ### calculation of the parameters of the sample
    
    m <- mean(x)
    s <- sd(x)
    med <- median(x)
    ci <- t.test(x)

    ### collect the data in a dataframe

    data <- rbind(data, c(m, s, med, ci$conf.int[1], ci$conf.int[2]))
    names(data) <- c("mean", "standard deviation", "median", "lower ci", "upper ci")

    }

### check the result

head(data)

##    mean standard deviation median lower ci upper ci
## 1 24.62              4.910  23.74    23.65    25.59
## 2 24.66              5.263  24.11    23.62    25.71
## 3 24.88              4.914  24.82    23.90    25.85
## 4 24.66              5.250  23.95    23.62    25.71
## 5 25.29              4.961  26.16    24.30    26.27
## 6 25.08              4.977  24.92    24.10    26.07

m <- data[,1]
sd <- data[,2]
med <- data[,3]
lower <- data[,4]
upper <- data[,5]

width.95CI.100 <- mean(upper - lower)



### plot the results
### The plot report the mean in the sample (black points) and the expected population mean (pink vertical line)
### The plot also includes the lower and the upper CI (blue and green, respectively)
### The extremes of the 95%CI for each sample are united by a line, which is red when the CI does not include the population mean


plot(data$mean, row.names(data), xlim=c(min(data[,4]-1), max(data[,5]+1)), xlab="Mean in the sample", ylab="Repetition", col="black", pch=19, cex=2)
abline(v=m.a,col="pink", lwd=4)

points(data[,4], row.names(data), col="blue",pch=19, cex=1.25)
points(data[,5], row.names(data), col="green",pch=19, cex=1.25)


for (i in 1:N) { 
    
    cl <- ifelse(data[,4][i] < m.a & m.a < data[,5][i], 8, 2)

    segments(data[,4][i], i, data[,5][i], i, col=cl, lwd=1.5)

    }

plot of chunk unnamed-chunk-2

By changing the sample size (n.a in the function) you can note the impact of sample size on the confidence interval.

Let ’ s see a sample size = 20.

####################################################################################################################################
###
### Calculation of 95% confidence interval for a mean
###
### We will have 100 simulation for a sample with 20 cases from a population with mean = 25 and standatd deviation = 5
###
####################################################################################################################################


### set seed for reproducibility

set.seed(345)

### creation of an empty dataframe 

data <- data.frame()

####################################################################################################################################
### set the values for the simulation on 100 cases
####################################################################################################################################

N <- 100        ### number of simulations
n.a <- 20       ### sample size in group 
m.a <- 25       ### mean in the population from which is sampled the group 
sd.a <- 5       ### standard deviation in the population from which is sampled the group 



####################################################################################################################################
### function for simulation
####################################################################################################################################

for (i in 1:N) { 

    ### creation of random variable from a population with mean = m.a and sd = sd.a
    x <- rnorm(n.a, m.a, sd.a)
    
    ### calculation of the parameters of the sample
    
    m <- mean(x)
    s <- sd(x)
    med <- median(x)
    ci <- t.test(x)

    ### collect the data in a dataframe

    data <- rbind(data, c(m, s, med, ci$conf.int[1], ci$conf.int[2]))
    names(data) <- c("mean", "standard deviation", "median", "lower ci", "upper ci")

    }

### check the result

head(data)

##    mean standard deviation median lower ci upper ci
## 1 25.59              5.418  24.43    23.06    28.13
## 2 23.85              4.964  22.31    21.53    26.18
## 3 24.10              5.014  23.81    21.75    26.44
## 4 23.91              3.631  23.58    22.21    25.61
## 5 25.65              5.453  26.14    23.10    28.20
## 6 24.71              4.914  24.04    22.41    27.01

m <- data[,1]
sd <- data[,2]
med <- data[,3]
lower <- data[,4]
upper <- data[,5]

width.95CI.20 <- mean(upper - lower)


### plot the results
### The plot report the mean in the sample (black points) and the expected population mean (pink vertical line)
### The plot also includes the lower and the upper CI (blue and green, respectively)
### The extremes of the 95%CI for each sample are united by a line, which is red when the CI does not include the population mean


plot(data$mean, row.names(data), xlim=c(min(data[,4]-1), max(data[,5]+1)), xlab="Mean in the sample", ylab="Repetition", col="black", pch=19, cex=2)
abline(v=m.a,col="pink", lwd=4)

points(data[,4], row.names(data), col="blue",pch=19, cex=1.25)
points(data[,5], row.names(data), col="green",pch=19, cex=1.25)


for (i in 1:N) { 
    
    cl <- ifelse(data[,4][i] < m.a & m.a < data[,5][i], 8, 2)

    segments(data[,4][i], i, data[,5][i], i, col=cl, lwd=1.5)

    }

plot of chunk unnamed-chunk-3

And now, let ’ s see a sample size = 1000.

####################################################################################################################################
###
### Calculation of 95% confidence interval for a mean
###
### We will have 100 simulation for a sample with 1000 cases from a population with mean = 25 and standatd deviation = 5
###
####################################################################################################################################


### set seed for reproducibility

set.seed(345)

### creation of an empty dataframe 

data <- data.frame()

####################################################################################################################################
### set the values for the simulation on 100 cases
####################################################################################################################################

N <- 100        ### number of simulations
n.a <- 1000     ### sample size in group 
m.a <- 25       ### mean in the population from which is sampled the group 
sd.a <- 5       ### standard deviation in the population from which is sampled the group 



####################################################################################################################################
### function for simulation
####################################################################################################################################

for (i in 1:N) { 

    ### creation of random variable from a population with mean = m.a and sd = sd.a
    x <- rnorm(n.a, m.a, sd.a)
    
    ### calculation of the parameters of the sample
    
    m <- mean(x)
    s <- sd(x)
    med <- median(x)
    ci <- t.test(x)

    ### collect the data in a dataframe

    data <- rbind(data, c(m, s, med, ci$conf.int[1], ci$conf.int[2]))
    names(data) <- c("mean", "standard deviation", "median", "lower ci", "upper ci")

    }

### check the result

head(data)

##    mean standard deviation median lower ci upper ci
## 1 25.00              5.089  24.91    24.68    25.32
## 2 25.04              4.814  25.08    24.75    25.34
## 3 25.00              4.786  24.91    24.71    25.30
## 4 25.14              4.815  25.19    24.84    25.44
## 5 25.04              4.912  25.18    24.73    25.34
## 6 24.90              4.888  24.86    24.60    25.20

m <- data[,1]
sd <- data[,2]
med <- data[,3]
lower <- data[,4]
upper <- data[,5]

width.95CI.1000 <- mean(upper - lower)

### plot the results
### The plot report the mean in the sample (black points) and the expected population mean (pink vertical line)
### The plot also includes the lower and the upper CI (blue and green, respectively)
### The extremes of the 95%CI for each sample are united by a line, which is red when the CI does not include the population mean


plot(data$mean, row.names(data), xlim=c(min(data[,4]-1), max(data[,5]+1)), xlab="Mean in the sample", ylab="Repetition", col="black", pch=19, cex=2)
abline(v=m.a,col="pink", lwd=4)

points(data[,4], row.names(data), col="blue",pch=19, cex=1.25)
points(data[,5], row.names(data), col="green",pch=19, cex=1.25)


for (i in 1:N) { 
    
    cl <- ifelse(data[,4][i] < m.a & m.a < data[,5][i], 8, 2)

    segments(data[,4][i], i, data[,5][i], i, col=cl, lwd=1.5)

    }

plot of chunk unnamed-chunk-4

We can see that the width of the confidence internal is related to the sample size.

width.95CI.20

## [1] 4.591

width.95CI.100

## [1] 1.942

width.95CI.1000

## [1] 0.6192

width <- cbind(width.95CI.20, width.95CI.100, width.95CI.1000)

barplot(width, col = "gold", cex.names = 0.75,
        names.arg = c("Width of 95%CI for n = 20", "Width of 95%CI for n = 100", "Width of 95%CI for n = 1000"))

plot of chunk unnamed-chunk-5

The link between the width of the confidence internal and sample size is reproducible. You can repeat the “experiment” several times, by changing the seed to have indpendent samples, and you’ll find that CI from larger sample size have lower width.

I did a replication experiment repeating 10 times the estimation in samples with size 20, 100 and 1000. Here the results.

Instead, the fraction of CI that contains the expected population mean does not vary relevantly by sample size.

### creation of an empty dataframe 

df <- data.frame()

    ### inizio del ciclo a 100

    for (j in 1:100) { 



####################################################################################################################################
### Set the conditions for the simulation
####################################################################################################################################


### set seed for reproducibility

set.seed(123)

### creation of an empty dataframe 

data <- data.frame()

####################################################################################################################################
### set the values for the simulation on 100 cases
####################################################################################################################################

N <- 100        ### number of simulations
n.a <- 5*j*2        ### sample size in group 
m.a <- 25       ### mean in the population from which is sampled the group 
sd.a <- 5       ### standard deviation in the population from which is sampled the group 



####################################################################################################################################
### function for simulation
####################################################################################################################################

for (i in 1:N) { 

    ### creation of random variable and group
    x <- rnorm(n.a, m.a, sd.a)
    
    ### calculation of the parameters of the sample
    
    m <- mean(x)
    s <- sd(x)
    med <- median(x)
    ci <- t.test(x)

    ### collect the data in a dataframe

    data <- rbind(data, c(m, s, med, ci$conf.int[1], ci$conf.int[2]))
    names(data) <- c("mean", "standard deviation", "median", "lower ci", "upper ci")

    }

### check the result

head(data)

m <- data[,1]
sd <- data[,2]
med <- data[,3]
lower <- data[,4]
upper <- data[,5]



### ciclo di confronto delle medie con il lower e l'upper ci della prima stima

dat <- data.frame()

for (i in 1:N) {
lw.i <- ifelse(m.a > lower[i], 1, 0)
    ## print(lw.i)


up.i <- ifelse(m.a  < upper[i], 1, 0)
    ## print(up.i)
    
    ### collect the data in a dataframe

    dat <- rbind(dat, c(lw.i, up.i))
    names(dat) <- c("lower ci passed", "upper ci passed")

        }

### ci includes the mean

ci.mean <- ifelse( dat[,1]+dat[,2] > 1, 1, 0)
ci.mean

### % ci includes the mean

    per.ci <- sum(ci.mean/length(ci.mean))

    ### collect the data in a dataframe

    df <- rbind(df, c(n.a, per.ci))
    names(df) <- c("Sample size", "CI including the mean")

    }

head(df)

##   Sample size CI including the mean
## 1          10                  0.98
## 2          20                  0.98
## 3          30                  0.96
## 4          40                  0.93
## 5          50                  0.98
## 6          60                  0.96

plot(df, pch=19, cex=2, col="blue")

rank <- cor.test(df[,1],df[,2], method="spearman", exact=FALSE)

options(scipen=999)
rho <- rank$estimate[[1]]
p <- rank$p.value
p <- ifelse(p < 0.0001, 0.0001, p)
c <- ifelse(p < 0.001, "<", "=")

mtext(paste0("Spearman rank correlation = ", round(rho,3), ", p ", c, round(p,4)), side = 3, line = -3, outer = TRUE)

plot of chunk unnamed-chunk-6

options(scipen=0)


### Pearson's product-moment correlation

cor.test(df[,1],df[,2])

## 
##  Pearson's product-moment correlation
## 
## data:  df[, 1] and df[, 2]
## t = -6.017, df = 98, p-value = 3.077e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6495 -0.3597
## sample estimates:
##     cor 
## -0.5194

### Summary of 95%CI including the mean

summary(df[,2])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   0.950   0.970   0.966   0.980   1.000

In the example there was a negative correlation, with a lower fraction of CI containing the expected population mean in the samples with higher size.

However, the finding is based on point estimations, which are greatly exposed to sampling error.

Just change the seed (e.g, 352 or 555) and you’ll find completely different results.

There is a trend for a correlation between the number of repetitions of the estimation of the CI and the fraction of CI containing the expected population mean. Overall, the variation is really limited, with fluctuations between 95% and 99%.

Overall, this fraction does converge to 95%, as more likely as greater the number of repeated testing (as expected).

Let ’ s see this with 10 replications of a variable number number of repetitions of simulations, from 20 to 2000.

####################################################################################################################################
###
### Calculation of 95% confidence interval for a mean
###
### We will have 10 replications of a variable number number of repetitions of simulations, from 20 to 2000
### for a sample with 100 cases from a population with mean = 25 and standatd deviation = 5
###
####################################################################################################################################

### Figures arrenged in 2 rows and 5 column

par(mfrow=c(2,5))

### Start a cycle of 10   replications

    for (k in 1:10) { 

### creation of an empty dataframe 

df <- data.frame()

    ### inizio del ciclo a 100

    for (j in 1:100) { 



####################################################################################################################################
### Set the conditions for the simulation
####################################################################################################################################


### set seed for reproducibility

seed<-123*k

set.seed(123*k)

### creation of an empty dataframe 

data <- data.frame()

####################################################################################################################################
### set the values for the simulation 100 cases
####################################################################################################################################

N <- 10*j*2     ### number of simulations
n.a <- 100      ### sample size in group 
m.a <- 25       ### mean in the population from which is sampled the group 
sd.a <- 5       ### standard deviation in the population from which is sampled the group 



####################################################################################################################################
### function for simulation
####################################################################################################################################

for (i in 1:N) { 

    ### creation of random variable and group
    x <- rnorm(n.a, m.a, sd.a)
    
    ### calculation of the parameters of the sample
    
    m <- mean(x)
    s <- sd(x)
    med <- median(x)
    ci <- t.test(x)

    ### collect the data in a dataframe

    data <- rbind(data, c(m, s, med, ci$conf.int[1], ci$conf.int[2]))
    names(data) <- c("mean", "standard deviation", "median", "lower ci", "upper ci")

    }

### check the result

head(data)

m <- data[,1]
sd <- data[,2]
med <- data[,3]
lower <- data[,4]
upper <- data[,5]


### ciclo di confronto delle medie con il lower e l'upper ci della prima stima

dat <- data.frame()

for (i in 1:N) {
lw.i <- ifelse(m.a > lower[i], 1, 0)
    ## print(lw.i)


up.i <- ifelse(m.a  < upper[i], 1, 0)
    ## print(up.i)
    
    ### collect the data in a dataframe

    dat <- rbind(dat, c(lw.i, up.i))
    names(dat) <- c("lower ci passed", "upper ci passed")

        }


### ci includes the mean

ci.mean <- ifelse( dat[,1]+dat[,2] > 1, 1, 0)
ci.mean

### % ci includes the mean

    per.ci <- sum(ci.mean/length(ci.mean))

    ### collect the data in a dataframe

    df <- rbind(df, c(N, per.ci))
    names(df) <- c("Repetitions", "CI including the mean")

    }

head(df)

rank <- cor.test(df[,1],df[,2], method="spearman", exact=FALSE)

options(scipen=999)

rho <- rank$estimate[[1]]
p <- rank$p.value
p <- ifelse(p < 0.0001, 0.0001, p)
c <- ifelse(p < 0.001, "<", "=")

plot(df, pch=19, cex=2, col="blue", main = paste0("Spearman rho = ", round(rho,3), ", p ", c, round(p,4)))

options(scipen=0)


mtext(paste0("Seed = ", seed), side = 3, line = -4, outer = FALSE)


    }

plot of chunk unnamed-chunk-7

par(mfrow=c(1,1))

As you can see, there is an instability of the finding. Sometime, as lower the number of repetitions of the estimation, as higher the chance that the fraction of CI containing the expected population mean is above 95%. However, when the repetions are less than 100, sometime can happen that the fraction of CI containing the expected population mean is less than 95%, but with increasing number of repetitions this fraction increases again to 95%, so reverting the direction of the correlation.

Again, simulations can be tricky: never trust one single simulation or a single set of simulations. Repeat, repeat, repeat!

The take-home message of this exercise is that confidence intervals can be used to estimate a population parameter. The sampling error does influence the chance that the confidence interval does not contain the population mean (which is unknown, outside simulations). The fraction of CI containing the population mean converges towards 95% in repeated testing. The width of the confidence interval is related to the sample size, the sample size being a proxy for the power of a test. Essentially, as greater the power of your estimation, as more precise is the estimation of the population parameter based on the CI.

One single estimation (an estimation based on a single sample) is not informative within the frequentist approach. A CI that has been estimated from a single sample can or cannot contain the population parameter: “The confidence to be placed in the interval is no greater than the confidence we have in the experimental design and statistical test it is based upon”. (Good & Hardin, cit., 2013).

I hope you’ve enjoyed this!

Have a nice day!

Alt text

Twitter: @AntoViral

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.

For more details on using R Markdown see http://rmarkdown.rstudio.com.