Confidence Intervals

2023-09-11

Definition

A confidence interval is described as an interval or range of sample means such that there is a certain probability that the true mean lies within its bounds.

For example, a 95% confidence interval is an interval such that the true mean will lie within its bounds 95% of the time. In other words, if you took 100 samples and calculated a confidence interval for each, about 95 of them would have the true mean within their bounds.

Calculating the true mean

Take a vector ‘V’ such that its values range from 1:100, the actual mean can be easily calculated by adding the first and last numbers and then dividing by two.

\(1 + 100 = 101, 2 + 99 = 101,\dots, 50 + 51 = 101\)

\({101 \over 2} = 50.5, {101\over2} = 50.5,\dots\)

Since each result is identical, this is equivalent to simply finding the mean of fifty values that are each \(50.5\); therefore, the mean is simply \(50.5\).

Obviously, most means that researchers are interested in aren’t as easily calculable as this example. Even so, this relatively simple data set will serve to show just how tricky trying to calculate the ‘true’ mean from random samples can become.

Taking a random sample

Imagine that 4 random samples of n = 20 are taken from this vector. The distribution could look something like this:

Sample means

The mean of each of these samples (hence, the sample means) are more than likely not equal to the true mean of our original vector. In fact, plotting the sample means gives us this:

Standard Deviation

The differing sample means in the previous slide reveal the problem inherent in random samples: there’s no guarantee that the random sample chosen will produce the true mean of the population.

With this in mind, one can calculate a confidence interval to contain the true mean. To do this, we will need the standard deviation, a value that represents the average distance of values from their mean. Mathematically, this is depicted as:

\(\Sigma^n_{i = 1}\) \(\sqrt{{x_n - \bar{x}}\over n - 1}\), where \(\bar{x}\) = mean, n = sample size and \({x_n}\) is the nth value in the sample.

Standard Deviation (cont)

Luckily for us, RStudio has a built-in function known as sd() to compute the standard deviation. We can use this function to show that the average distance from the mean (otherwise known as the standard deviation) of a 1:3 vector is 1.

##First, we introduce a vector and calculate its mean:
new_v = 1:3 
mean(new_v)

## [1] 2

##Now we find the standard deviation:
sd(new_v)

## [1] 1

Differing Standard Deviation

Just like the means, the standard deviation will also vary slightly by sample. For example:

Building the Confidence Interval

The Z-score is a number on the Z-distribution table that indicates the number of standard deviations that a data point lies away from the mean. For samples with less than n = 30 (like ours), a T-distribution is used instead.

With this in mind, the general equation for constructing a confidence interval is:

CI = \(\bar{x} \pm z \cdot {sd\over \sqrt{n}}\)

The z-score you choose is dependent on the level of confidence you wish to have. For a 95% confidence interval, you would choose a z-score for the 97.5th/2.5th percentile (because it is an interval, you need 2.5% on the left and 2.5% on the right).

The Final Product

The t-score for 95% confidence is given by finding the two-tailed score of \(\alpha\) = .05 (confidence = \(1-\alpha\)) at 19 degrees of freedom (n -1). According to the t-distribution table, this value is 2.093. This number, in combination with data from our first sample, produces the equation:

CI = 51.6 \(\pm 2.093 \cdot {29.3480027\over \sqrt{20}} => 43.75 \pm 12.68\)

As you can see, the true mean \(50.5\) does indeed lie in between \(31.07\) and \(56.43\).

This will be the case 95% of the time.