DAT-301 / HW-3

2023-06-11

Confidence Intervals

When you study a sample population, and create summary statistics based on that sample , there is always some uncertainty as to whether your sample statistics represent the true population statistics.

Confidence Intervals can be used to estimate a range of values that you would expect your estimate to fall in between a pre-defined percentage of the time. We will focus on two-tailed intervals, which means you will have both upper and lower bounds for your range.

Calculations

In order to calculate a confidence interval, you will need 4 pieces of information

The point estimate you are calculating the interval for.
The critical value for the test statistic
The standard deviation of the sample
The sample size

Calculations - Part 2

The point estimate, can be anything from the sample mean, the difference between two variable means or a variety of other estimates. We will focus on sample mean \(\bar{X}\) in this example.
The critical value for the normal distribution can be found in a z-distribution table. Critical values (Z-score) can be understood as how many standard deviations a point is away from the point estimate. Here are some commonly used critical values and their associated probability range.

Z	Prob
0.98	67%
1.96	95%
2.58	99%

Now that we have the Z-scores, we can look at the formula for Confidence Intervals

\(CI = \bar{X}\pm Z*\frac{\sigma}{\sqrt{n}}\)

Calculations - Part 3

The standard deviation is the square root of the variance, where the variance is the sum of the squared differences from the mean or:

\(var = \sigma^2 = \sum_{i=1}^{n}*\frac{(X_1-\bar{X})^2}{n-1}\)

Lastly, the sample size is simply the number of observations in your sample.

Plot of 95% Confidence Interval

Let’s plot a standard normal distribution (mean=0, standard deviation=1) and take a look at a 95% confidence interval, which will equate to a z-score of ~\(\pm1.96.\)

z <-  seq(-4, 4, 0.01); f_z <- dnorm(z)
q <- qnorm(.975); x <- seq(-q, q, .01)
y <- c(0, dnorm(x), 0); x <- c(-q, x, q)

ggplot() + geom_line(aes(z, f_z)) +
  geom_polygon(data=data.frame(x=x,y=y), aes(x, y), fill="blue") +
  theme(plot.margin = margin(0,1,3,1, "cm")) + ylab("")

Plot of 99% Confidence Interval

Here, we will look at the 99% confidence interval, highlighting the part of distribution that falls outside the interval. This will equate to ~\(\pm2.58\) standard deviations from the mean.

Interactive: Actual vs Theory

Now, we will look at a standard normal vs a randomized sample. We will create a sample of 1000 observations with an expected mean of 0 and standard deviation of 1. We will then plot the random sample via a histogram and compare the results to the standard normal we just described in the previous slides. We can use an interactive graph to compare our actual sample vs the theoretical distribution.