Objectives:

Motivation:

In the 4.1 notes, we discussed the concept of sampling distributions as well as the Central Limit Theorem to allow us to describe how a sample of observations from a particular distribution will behave. Recall that the distributions we discussed in the 4.1 notes came from what we considered to be “known” distributions - distributions for which we know the specifics for how to compute the mean and variance/standard deviation of the distribution.

Examples

But what happens if we run into a scenario where we have a random variable that doesn’t follow one of these “known” distributions? In this set of notes, we’ll discuss several different approaches we can take when we must sample from an unknown distribution.

Sampling from Populations

Variation in Sampling

One goal of sampling is to use the sample we collect to make an inference or claim about the larger population from which the sample was taken. That is, based on a smaller (and hopefully representative!) subset of the population, what can be said about the population itself?

Example

Suppose we are interested in the average height of all active Major League Baseball (MLB) players. If we take a sample of \(n=30\) players and measure their heights, we can use the average height of those 30 players to make a claim about the average height of all active players.

In the example above, we are using a sample mean to make a claim about a population mean. One thing to note about using samples to estimate what’s going on in the population is that different samples may produce different results. That is, samples will vary even when the samples are collected in the same manner.

Example

Etc.

This variation or variance from sample-to-sample is really the focus of statistics!

Parameters and Statistics - A Review

As was first discussed in the 3.1 notes, one of the best ways of “summarizing” the variation that comes with collecting data is to aggregate all of the information collected from each object in the target population into a single numerical property. Recall from the 3.1 notes that a population is the set of all instances (units, people, objects, events, regions, etc.) we are interested in when wanting to study the behavior of a variable. We usually denote the size of a population (if known) with \(N\). At the population level these numerical properties are called parameters.

Ideally, we would record information on a variable of interest for every observation in a given population and thus would be able to compute appropriate population parameters. However, in a lot of cases, it is impractical (or too expensive, or even impossible) to do so, since most populations are very large. Instead we most often select a subset of observations from a population of interest and record information on our variable of interest for every observation in that subset.

Recall from the 3.1 notes that a sample is a selected subset of observations from the population, usually much smaller than the population itself. We usually denote the size of a sample with \(n\).

By obtaining a measurement from each object in a sample of the population, we can then make inferences about the population as a whole. Just as we can think about aggregating all of the information collected from each object in a population into a single numerical numerical property, we can do so in a sample as well. At the sample level, these summary values are called statistics.

Example 4.2.1

The following data were produced by a simple random sample of houses that have sold in my area in the past six months. The variable of interest is the selling price of a home, in $1000s.

575.0, 549.0, 572.5, 649.9, 485.0

Let’s consider a few sample statistics based on this sample of \(n = 5\) houses.

The sample mean is calculated as \(\overline{x}=\frac{\sum_{i=1}^{n}x_i}{n}\)

The sample standard deviation is calculated as \(s=\sqrt{\frac{\sum_{i=1}^{n}(x_i-\overline{x})^2}{n-1}}\)

The \(50^{\text{th}}\) percentile (also called the median) is the number such that 50% of the data is less than that number.

The \(25^{\text{th}}\) percentile (also called the \(1^{\text{st}}\) quantile or \(Q1\)) is the number such that 25% of the data is less than that number.

The \(75^{\text{th}}\) percentile (also called the \(3^{\text{rd}}\) quantile or \(Q3\)) is the number such that 75% of the data is less than that number.

library(mosaic)
x = c(575.0, 549.0, 572.5, 649.9, 485.0)
favstats(x)
##  min  Q1 median  Q3   max   mean       sd n missing
##  485 549  572.5 575 649.9 566.28 59.18629 5       0

Now consider a sixth house that has been sold was randomly picked. The selling price of this particular house was $975,000. Compute the mean and median of this new sample:

485.0, 549.0, 572.5, 575.0, 649.9, 975.0

x=c(575.0, 549.0, 572.5, 649.9, 485.0, 975.0)
favstats(x)
##  min      Q1 median      Q3 max  mean       sd n missing
##  485 554.875 573.75 631.175 975 634.4 175.0555 6       0

Result: The sample median is said to be robust, or less sensitive to irregular/extreme data points, compared to the sample mean.

Irregular data points can (and often do) occur when the variable of interest possesses (in the population) a skewed distribution (either right-skewed or left-skewed). In such cases, the sample median should be used as the appropriate measure of center.

One must also be careful in using the sample standard deviation as a measure of spread! Use the value of \(s\) to estimate the spread in the population variable of interest only when the sample mean \(\overline{x}\) is being used as the measure of spread. This is because the sample standard deviation (like the sample mean) is very sensitive to irregular/extreme values. The “skewing effect” of the sample mean is compounded in the calculation of \(s\).

So what is one way we can deal with skewed distributions in the context of trying to estimate parameters? One option to consider is bootstrapping.

Bootstrapping

In statistics, bootstrapping is any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

The general process of bootstrapping is as follows:

  1. An initial sample of \(n\) observations is taken from a distribution.
  1. From that initial sample, we sample with replacement \(n\) times to obtain a bootstrap sample. We do this repeatedly to obtain a large number of bootstrap samples.
  1. For each bootstrap sample, the sample statistic of interest is calculated (\(\overline{x}\) or \(\widehat{p}\)).

  2. These bootstrap sample statistics form the sampling distribution of the sample statistic.

As we will see later on in the notes, we can use this bootstrap-based sampling distribution to form confidence interval estimates of parameters.

Bootstrapping allows for a better estimation of the sampling distribution of a sample statistic when one cannot assume that the variable follows a normal distribution (or any of the other well-known distributions, such as Poisson, exponential, etc.).

To better demonstrate this, let’s go back to the modified Example 4.2.1.

Example 4.2.1 (revisited)

The following data were produced by a simple random sample of six houses that have sold in my area in the past six months. The variable of interest is the selling price of a home, in $1000s.

575.0, 549.0, 572.5, 649.9, 485.0, 975.0

Let’s use R to compare the distribution of the sample itself, a distribution based on the normal distribution, and a bootstrap distribution.

First, the distribution of the sample:

x=c(575.0, 549.0, 572.5, 649.9, 485.0, 975.0)
mean(x)
## [1] 634.4
sd(x)
## [1] 175.0555
favstats(x)
##  min      Q1 median      Q3 max  mean       sd n missing
##  485 554.875 573.75 631.175 975 634.4 175.0555 6       0

Now, a distribution based on the normal distribution:

y=rnorm(n = 1000, mean = 634.4, sd = 175.0555)
favstats(y)
##       min       Q1   median       Q3      max     mean       sd    n missing
##  118.7369 521.6396 648.6604 757.2487 1306.529 637.3441 178.8723 1000       0

Finally, a bootstrap distribution:

library(mosaic)
x = c(575.0, 549.0, 572.5, 649.9, 485.0, 975.0)
B = do(1000)*mean(resample(x, 6))
favstats(B$mean)
##  min       Q1   median       Q3      max    mean       sd    n missing
##  500 580.2167 632.5833 680.0875 920.8167 635.286 64.70401 1000       0

Compare the means and standard deviations of these three “distributions.”

What do we see? While the medians of these three distributions are approximately the same (as are the means), notice that there is far less variability in the bootstrap boxplot than in the “normal” boxplot.

Why does it matter? Bootstrap-based distributions can be considered as more robust against extreme values compared to traditional methods of estimating parameters. We will see the use of this in the next section.

Estimating Parameters with Confidence Intervals

As discussed above, parameters represent numerical properties that act as summary values used to describe the behavior of a given variable in the population of interest. Statistics are these summary values calculated at the sample level and are typically used to estimate the corresponding (unknown) population parameters.

Examples

These sample statistics on their own are what we call point estimates, or single-value estimates of their corresponding population parameters. Since these point estimates vary from sample to sample, using them on their own to estimate the corresponding population parameter may not be extremely useful, as this variation of the sample statistics is not taken into account.

Another option for estimation involves the computation of confidence intervals. A confidence interval is constructed based on sample data and a point estimate and is an interval estimate of the target population parameter. It gives us a range of values that are plausible for the population parameter.

In this class, we will discuss how to construct confidence intervals for two parameters of interest: \(p\) and \(\mu\).

Estimating \(p\) with Confidence Intervals

Selecting the Margin of Error

The margin of error in a confidence interval is the amount that we will both add and subtract from the point estimate of interest in order to produce a desired confidence level for a confidence interval. We can change the confidence level by changing the margin of error.

The greater the margin of error, the higher our confidence level will be. We represent the allowable error that our interval will contain with \(\alpha\). For example, if we construct an interval for which we could be 100% sure that the interval would always range across all possible values of the parameter of interest, then the allowable error is \(\alpha=0\). Another way to state this is to say that we have made a \((1-\alpha)*100\)% confidence interval or, in this case, a \((1-0)*100\)% or a 100% confidence interval.

The structure of a confidence interval for \(p\) is:

\[\widehat{p}\pm\text{ margin of error}\] \[\widehat{p}\pm Z_{\frac{\alpha}{2}}\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\text{ where }\widehat{p}=\frac{X}{n}\]

Example 4.2.2

An Ipsos-Reid poll of \(n = 1,034\) randomly selected Canadian voters was taken between February \(14^{th}\) and February \(18^{th}\). Each voter was asked the following question: “If a federal election were to be held tomorrow, what political party would you vote for?” \(382\) would vote for the Liberal Party should a federal election to occur tomorrow.

  1. Compute the sample proportion as well as the margin of error.
382/1034
## [1] 0.3694391
qnorm(.975)*(382/1034*(1-382/1034)/1034)^(.5)
## [1] 0.02941865
  1. Use the information computed above to find a 95% confidence interval estimate for the percentage of all Canadian voters who would vote for the Liberal Party.
382/1034 - qnorm(.975)*(382/1034*(1-382/1034)/1034)^(.5)
## [1] 0.3400204
382/1034 + qnorm(.975)*(382/1034*(1-382/1034)/1034)^(.5)
## [1] 0.3988577

Confidence Intervals Based on Bootstrap Percentiles

If we were only concerned with 95% confidence intervals and always had a symmetric, bell-shaped bootstrap distribution, the confidence interval as it is computed in the above section would probably be all that we need. But we may end up with a bootstrap distribution that is symmetric but subtly flatter (or steeper) so that more (or less) than 95% of bootstrap statistics are within \(Z_{\frac{\alpha}{2}}\) standard errors of the center.

Fortunately, we can use the percentiles of the bootstrap distribution to locate the actual middle \((1-\alpha)*100\)% of the distribution. Specifically, if we want the middle \((1-\alpha)*100\)% of the bootstrap distribution (the values that are most likely to be close to the center), we can just chop off the lowest \(\frac{\alpha}{2}*100\)% and highest \(\frac{\alpha}{2}*100\)% of the bootstrap statistics to produce an interval.

Visually:

Example 4.2.2 (revisited)

An Ipsos-Reid poll of \(n = 1,034\) randomly selected Canadian voters was taken between February \(14^{th}\) and February \(18^{th}\). Each voter was asked the following question: “If a federal election were to be held tomorrow, what political party would you vote for?” \(382\) would vote for the Liberal Party should a federal election to occur tomorrow. Compute a bootstrap 95% confidence interval estimate for the percentage of all Canadian voters who would vote for the Liberal Party.

library(mosaic)

B = do(1000)*mean(resample(c(rep(1, 382), rep(0, 1034-382)), 1034));

quantile(B$mean, 0.025);
##      2.5% 
## 0.3413685
quantile(B$mean, 0.975);
##     97.5% 
## 0.3984526

Interpreting Confidence Intervals

A confidence interval for a sample proportion gives a set of values that are plausible for the population proportion. If a value is not in the confidence interval, we conclude that it is an implausible/unlikely value for the actual population proportion. It’s not impossible that the population value is outside the interval, but it would be pretty surprising.

For example, suppose a candidate for political office conducts a poll and finds that a 95% confidence interval for the proportion of voters who will vote for him is 42% to 48%. He would be wise to conclude that he does not have 50% of the population voting for him. The reason is that the value 50% is not in the confidence interval, so it is implausible to believe that the population value is 50%. Sometimes drawing a picture helps!

There are many common misinterpretations of confidence intervals that you must avoid. The most common mistake that is made is trying to turn confidence intervals into some sort of probability problem. For example, if asked to interpret a 95% confidence interval of 45.9% to 53.1%, many people would mistakenly say, “This means there is a 95% chance that the population percentage is between 45.9% and 53.1%.”

What’s wrong with this statement? Remember that probabilities are long-run frequencies. The above interpretation claims that if we were to repeat this survey many times, then in 95% of the surveys the true population percentages would be a number between 45.9% and 53.1%. This claim is wrong! This is because the true population percentage doesn’t change. It is either always between 45.9% and 53.1% or it is never between these two values. It can’t be between these two numbers 95% of the time and somewhere else the rest of the time.

Another analogy will help make this clear. Suppose there is a skateboard factory where 95% of the skateboards produced are perfect, but the other 5% have no wheels. Once you buy a skateboard from this factory, you can’t say that there is a 95% chance that it has wheels. Either it has wheels or it does not have wheels. It is not true that the board has wheels 95% of the time and, mysteriously, no wheels the other 5% of the time. A confidence interval is like one of these skateboards. Either it contains the true parameter (has wheels) or it does not. The “95% confidence” refers to the “factory” that “manufactures” confidence intervals: 95% of its products are good, 5% are bad. Our confidence is in the process, not in the product.

A correct interpretation: We are \((1-\alpha)*100\)% sure that the true population proportion is between the lower and the upper limit calculated.

Example 4.2.3

A random sample of \(n=3,005\) Canadians between the ages of 30 and 65 revealed that \(1,683\) expect to work past the traditional retirement age of 65.

  1. Find a 99% confidence interval for \(p\), the proportion of Canadians aged 30 to 65 who expect to be working past the age of 65.
1683/3005 - qnorm(.995)*(1683/3005*(1-1683/3005)/3005)^(.5)
## [1] 0.5367423
1683/3005 + qnorm(.995)*(1683/3005*(1-1683/3005)/3005)^(.5)
## [1] 0.5833908
  1. Interpret the meaning of this interval in the context of the data.
  2. Can you infer from the interval above that (a) \(p = 0.54\)? (b) \(p < 0.60\)?

Example 4.2.4

A random sample of \(n = 109\) first-year University of Calgary students revealed that 23 had used marijuana in the past six months.

  1. Find a 95% confidence interval for \(p\), the proportion of all first-year University of Calgary students that have used marijuana in the past six months, based on the distribution of \(\hat{p}\) and based on bootstrapping.
p = 23/109
n = 109
conf = 0.95
p - qnorm(conf+(1-conf)/2)*(p*(1-p)/n)^(.5)
## [1] 0.1344105
p + qnorm(conf+(1-conf)/2)*(p*(1-p)/n)^(.5)
## [1] 0.2876079
library(mosaic)

B = do(1000)*mean(resample(c(rep(1,23),rep(0,n-23)), n));

quantile(B$mean, (1-conf)/2);
##     2.5% 
## 0.146789
quantile(B$mean, conf+(1-conf)/2);
##    97.5% 
## 0.293578
  1. Can you conclude from the findings above that (a) 20% of first-year University of Calgary students have used marijuana in the past six months? (b) more than 25% of first-year University of Calgary students have used marijuana in the past six months?

Estimating \(\mu\) with Confidence Intervals

Selecting the Margin of Error

Recall that we used the CLT to create the confidence interval formula for proportions. It would be nice to use CLT to create the confidence interval formula for means as well!

Recall also the following formula:

\[Z=\frac{\overline{x}-\mu}{\frac{\sigma}{\sqrt{n}} }\] It would be nice if we could divide by the true standard error, \(\frac{\sigma}{\sqrt{n}}\). The problem is that in real life, we almost never know the value of \(\sigma\), the population standard deviation. In fact, in order to calculate it, we would have to know \(\mu\), which is what we are trying to estimate!

So instead, we replace it with an estimate: the sample standard deviation, \(s\). This gives us an estimate of the standard error: \(\frac{s}{\sqrt{n}}\)

However, \(\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n}} }\ne Z\) since we changed the \(\sigma\) to an \(s\). In fact,

\[t=\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n}}}\]

That is, this computation is not a z-score but rather a t-score and does not come from the normal distribution. Instead, it comes from a new distribution called the Student’s t-distribution (or just t-distribution).

Note: the t-distribution was discovered by William Gosset. However, he was working at the Guinness Brewery at the time, which did not allow employees to publish their work. So instead, he published his work under the pen name “Student.”

This t-distribution is a better model for the sampling distribution of \(\overline{x}\) than the normal distribution when \(\sigma\) is not known (that is, when it must be estimated with \(s\)).

\[s=\sqrt{\frac{\sum_{i=1}^n(x_i-\overline{x})^2}{n-1}}\text{ and } \sigma=\sqrt{\frac{\sum_{i=1}^N(x_i-\mu)^2}{N}}\]

The t-distribution shares many characteristics with the standard normal distribution. Both are symmetric, unimodal, and might be described as “bell-shaped.”

The t-distribution’s shape depends on only one parameter, called the degrees of freedom (df). The number of degrees of freedom is (usually) an integer: 1, 2, 3, and so on. In this case, the degrees of freedom is the number of gaps in the data or \(n-1\). Ultimately, when the degrees of freedom is infinitely large, the t-distribution is exactly the same as the standard normal distribution.

Therefore, to create a \((1-\alpha)*100\)% confidence interval for the population mean, \(\mu\) when \(\sigma\) is unknown, we will use the following formula: \[\overline{x}\pm t_{\frac{\alpha}{2},n-1}\frac{s}{\sqrt{n}}\]

where \(t_{\frac{\alpha}{2}, n-1}=P(T_{n-1}\geq t_{\frac{\alpha}{2}} )=\frac{\alpha}{2}\)

Remember, if \(\sigma\) is known we can use it (and the standard normal distribution). Also note that if \(n\) is large, the CLT ensures \(s \approx \sigma\). However, this is only an approximation; it is still best to use t when \(\sigma\) is unknown!

To create a \((1-\alpha)*100\)% confidence interval for the population mean, \(\mu\) when \(\sigma\) is known, we will use the following formula: \[\overline{x}\pm Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}\]

where \(z_{\frac{\alpha}{2}}=P(Z\geq z_{\frac{\alpha}{2}} )=\frac{\alpha}{2}\)

Confidence Intervals Based on Bootstrap Percentiles

Just as was the case for confidence intervals involving proportions, we may end up with a bootstrap distribution that is symmetric but subtly flatter (or steeper) so that more (or less) than 95% of bootstrap statistics are within \(Z_{\frac{\alpha}{2}}\) standard errors of the center.

So we can use the percentiles of the bootstrap distribution to locate the actual middle \((1-\alpha)*100\)% of the distribution. Specifically, if we want the middle \((1-\alpha)*100\)% of the bootstrap distribution (the values that are most likely to be close to the center), we can just chop off the lowest \(\frac{\alpha}{2}*100\)% and highest \(\frac{\alpha}{2}*100\)% of the bootstrap statistics to produce an interval.

Example 4.2.5

One of the exciting aspects of a university professor’s life is the time one spends in meetings. A stratified random sample of 40 professors from various science departments was taken. Each professor was asked, “In a week, how many hours do you typically spend in meetings?” The mean of this sample was \(\overline{x} = 9.85\) hours. Assume that the standard deviation in the amount the number of hours per week spent in meetings for all professors in this particular science faculty is 8 hours, or \(\sigma\) = 8 hours.

  1. Find a 95% confidence interval for \(\mu\), the mean number of hours a professor in this particular science faculty spends in meetings in a week.
mean = 9.85
sigma = 8
n = 40
conf = .95

mean - qnorm(conf+(1-conf)/2)*sigma/n^.5
## [1] 7.37082
mean + qnorm(conf+(1-conf)/2)*sigma/n^.5
## [1] 12.32918
  1. Interpret the meaning of the above interval in the context of the data.
  2. If the level of confidence was increased from 95% to, say, 99% what would happen to the width of the confidence interval?

Example 4.2.6

A study focusing on financial issues and concerns of post-secondary students in Canada was recently conducted by the Royal Bank of Canada. A subset of \(n = 200\) recent graduates from an undergraduate program or diploma was randomly chosen and the debt as a result of going to school (defined as student debt) was determined for each. This produced an average student debt of \(\$26,680\) and a standard deviation of \(\$4,500\). You want to find a 95% confidence interval estimate for \(\mu\), the average level of student debt for all recent graduates from a post-secondary institution (excluding graduate programs).

  1. Find the standard error and the margin of error for 95% confidence.
mean = 26680
s = 4500
n = 200
conf = 0.95

s/n^.5
## [1] 318.1981
qt(conf+(1-conf)/2,n-1)
## [1] 1.971957
qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 627.4727
  1. Find a 95% confidence interval estimate for the average level of student debt for all recent graduates from a post-secondary institution (non graduate programs).
mean = 26680
s = 4500
n = 200
conf = 0.95

mean-qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 26052.53
mean+qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 27307.47
  1. Interpret the meaning of the interval calculated above in the context of the data.

Example 4.2.7

The amount of sewage and industrial pollutants dumped into a body of water affects the health of the water by reducing the amount of dissolved oxygen available for aquatic life. Over a two-month period, sixteen samples of water were taken from a river one kilometer downstream from a sewage treatment plant. The amount of dissolved oxygen in the each sample of river water was determined and is given below.

\[5.4, 5.4, 5.6, 4.2, 4.7, 5.3, 4.4, 4.9, 5.2, 5.9, 4.7, 4.9, 4.8, 4.9, 5.0, 5.5\]

The mean, median, and the standard deviation of the above sample are given as \(\overline{x} = 5.05\), \(\widetilde{x}= 4.95\), \(s = 0.453\)

  1. Find a 95% confidence interval estimate for \(\mu\), the mean dissolved oxygen level during the two-month period in the river located one-kilometer downstream from the sewage plant. Compute this interval using the appropriate margin of error.
x=c(5.4, 5.4, 5.6, 4.2, 4.7, 5.3, 4.4, 4.9, 5.2, 5.9, 4.7, 4.9, 4.8, 4.9, 5.0, 5.5)
mean=mean(x)
s=sd(x)
n=length(x)
conf=.95

mean-qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 4.80854
mean+qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 5.29146
  1. Find a 95% confidence interval estimate for \(\mu\), the mean dissolved oxygen level during the two-month period in the river located one-kilometer downstream from the sewage plant. Compute this interval based on bootstrapping.
library(mosaic)
B = do(1000)*mean(resample(c(5.4, 5.4, 5.6, 4.2, 4.7, 5.3, 4.4, 4.9, 5.2, 5.9, 4.7, 4.9, 4.8, 4.9, 5.0, 5.5), n));

quantile(B$mean,(1-conf)/2);
##    2.5% 
## 4.83125
quantile(B$mean,((1-conf)/2)+conf);
##    97.5% 
## 5.250156
  1. From the above confidence intervals, could you conclude that \(\mu < 5\)? Which interval is better?

Example 4.2.8

In a report from the Bank of Montreal Outlook on holiday spending for the year 2014, a survey was conducted by Pollara in which 115 Albertans were randomly chosen and each was asked how much they would spend on gifts for people in the upcoming holiday season (excluding amount spent on trips, entertaining, and other spending).

The mean, median, and standard deviation resulting from this survey are: \(\overline{x} = \$652.00, \widetilde{x} = \$643.00, s = \$175\).

  1. From this sample, construct a 95% confidence interval for \(\mu\), the mean amount Albertans spent in the holiday season in 2014.
mean = 652
s = 175
n = 115
conf = 0.95

mean-qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 619.6725
mean+qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 684.3275
  1. The same study revealed that the mean amount spent by all Quebec consumers during the 2014 holiday season was \(\$460\). Does the interval computed above suggest that, on average, Albertans will spend more this holiday season compared to consumers in Quebec?

Which Interval Should We Compute? (A Guide!)

In this set of notes, we’ve learned about three different “structures” of confidence intervals: one that relies on the z-distribution (a z-score), one that relies on a t-distribution (a t-score), and one that is built on bootstrapping the original sample. Which interval should be used and when? The following guide should help you determine when each interval is most appropriate to use.

Intervals for \(\mu\)

Use bootstrapping when…

Use an interval based on the z-distribution when…

Use an interval based on the t-distribution when…

Intervals for \(p\)

Use bootstrapping when…

Use an interval based on the z-distribution when…