MAS 261 - Lecture 14

Effect of Sample Size and Confidence level on Confidence Intervals

Penelope Pooler Eisenbies

2024-10-09

Housekeeping

Today’s plan 📋
More Comments about Quiz 1
A few minutes for R Questions 🪄
Review of the t distribution
- Introduction of the t table (Very useful!)
Review of
- t-test output
- Interpretation and Calculation of Confidence Intervals

R and RStudio

In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I will demo how to download completed work so that you can use this allotment efficiently.
- For those who want to go further with R/RStudio:
  - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.

💥 Lecture 14 In-class Exercises - Q1 💥

In 2023, the top 1000 YouTubers include 293 that are based in the United States

The global top 1000 YouTubers have an average number of subscribers of 21.89 million, but we don’t know the variability in the global data.

This dataset includes a random sample of 30 of these U.S. YouTubers.

Use t.test and vdist_t_prob to better understand the probability of seeing our observed US sample data.

Import the data

yt30 <- read_csv("data/Youtube_US_30.csv", show_col_types = F)

Find the t-statistic using t.test with \(\mu = 21.89\).
Use vdist_t_prob to find probability of seeing the US sample mean or a lower mean.

Calculating a t statistic Manually

\(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
Format is almost identical to Z, but we substitute S for \(\sigma\)

mn30 <- mean(yt30$subscribers_mil)
sd30 <- sd(yt30$subscribers_mil)
mu <- 21.89
t <- (mn30-mu)/(sd30/sqrt(30))

💥 Lecture 14 In-class Exercises - Q2 and Q3 💥

Interpreting the t.test output:

Recall, I mentioned that we will discuss the p-value and hypothesis in upcoming weeks.

In the last lecture, we began talk about confidence intervals.
Interpretation of a 95% Confidence Interval:
- We are 95% sure (confident) that the interval estimated based on sample data contains the true mean number of subscribers for US YouTubers.
- From Example Data: We are 95% confident that the true mean number of subscribers for US YouTubers is between ____ and ____.
- Round Answers to two decimal places.

t.test(yt30$subscribers_mil)

What This Interval Tells US and What It Does Not

This Interval provides estimated bounds.
We 95% confident that these estimated confidence bounds capture the true mean.

We are NOT 95% confident what an individual’s number of subscribers will be.

We are NOT 95% confident what the true mean is (unless we measure the whole population).

We are NOT 95% confident what another country’s YouTubers average number of subscribers is.

Why are we only 95% confident?

Let’s say we hypothetically had 10,000 samples of size 30 of this population.
Our histogram of the means mught look like this:

More about Why are we only 95% confident?

Each one of these means (in the histogram) is the center point of a confidence interval.
As we see, by random chance, some of these means may be far from true population mean.
These random outliers result in intervals that may not contain the true population mean.
If we (hypothetically) repeated this sampling procedure 10000 times, and each time we estimated a 95% interval:
- About 9500 (.95 X 10000) of the these intervals would succeed in capturing the true population mean
- about 500 (.05 * 10000) of these interval would NOT capture the true mean

💥 Lecture 14 In-class Exercises - Q4 💥

If we repeated the sampling procedure for the US YouTube data 9000 times, and each time we calculated a 90% confidence interval.

Based on what we know of about confidence intervals, and the sampling distribution of the sample mean, how many of these 9000 intervals would NOT contain the true population mean, \(\mu\)?

Confidence Intervals - Helpful Terminology

Reminder of How Interval is Calculated

Centered at sample mean, \(\overline{X}\).
- Interval Lower Bound:
  - \(\overline{X} - E = \overline{X}-\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)
- Interval Upper Bound:
  - \(\overline{X} + E = \overline{X}+\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)
- Margin of Error (E):
  - \(E = \frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)

More about the Margin of Error

CI calculation

Results match t.test output

S <- sd(yt30$subscribers_mil)              # sample sd
alpha <- 1 - .95
1 - alpha/2

[1] 0.975

t_95 <- qt(.975, 29)            #.025 = .05/2 in each tail 
(E <-  S/sqrt(30) * t_95)     # margin of error

[1] 3.778387

(LB <- mean(yt30$subscribers_mil) - E)   # CI Lower Bound

[1] 16.52828

(UB <- mean(yt30$subscribers_mil) + E)   # CI Upper Bound

[1] 24.08505

t.test(yt30$subscribers_mil)


    One Sample t-test

data:  yt30$subscribers_mil
t = 10.992, df = 29, p-value = 0.000000000007401
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 16.52828 24.08505
sample estimates:
mean of x 
 20.30667

The Part You May Get Stuck On: \(\alpha\) and \(1 - \frac{\alpha}{2}\)

If we estimate a 95% confidence interval, then the probability that we don’t succeed is \(\alpha = .05\) or 5%.
That 5% is split evenly between the two tails of the distribution
- 0.025 or 2.5% in each tail
- To find the correct t value for a confidence interval:
  - Find \(\alpha\) and \(1 - \frac{\alpha}{2}\)
  - Determine degree of freedom, \(df = n - 1\)
- Example using 95% CI for YouTube data:
  - \(\alpha = .05\)
  - \(1 - \frac{\alpha}{2}=1-\frac{.05}{2}=1-.025=.975\)
  - \(df = n-1 =30-1 = 29\)

qt(.975,29)

[1] 2.04523

Illustrating \(\alpha\) and \(1 - \frac{\alpha}{2}\)

The diagram below shows how the 95% confidence interval relates to the t distribution.

A Bit of (Old Fashioned) Help

In this course, we advocate for computer calculations instead of old-fashioned paper tables.
- BUT this t table is exceptional.
- It shows Confidence Levels at the Bottom and \(\frac{\alpha}{2}\) at the top.
- It shows the Confidence Interval t value for a number of degrees of freedom
- It can help you check your work as you experiment with using the qt function.

In this course, we can use t-test to find confidence interval bounds
BUT you will also be expected to understand how the interval is constructed
- How to find components o the margin or Error, \(\frac{S}{\sqrt{n}}\) and \(t_{df,1-\frac{\alpha}{2}}\)
- How to find Confidence Lower Bound, \(\overline{X}-E\), and Upper Bound,\(\overline{X}+E\).

Changing the Confidence Level

All Common Intervals: 80%, 90%, 95%, 99%
- In order to be MORE confident that interval captures true mean, the interval must be wider.

t.test(yt30$subscribers_mil, conf.level = .8)


    One Sample t-test

data:  yt30$subscribers_mil
t = 10.992, df = 29, p-value = 0.000000000007401
alternative hypothesis: true mean is not equal to 0
80 percent confidence interval:
 17.88390 22.72943
sample estimates:
mean of x 
 20.30667

t.test(yt30$subscribers_mil, conf.level = .9)


    One Sample t-test

data:  yt30$subscribers_mil
t = 10.992, df = 29, p-value = 0.000000000007401
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
 17.16767 23.44566
sample estimates:
mean of x 
 20.30667

t.test(yt30$subscribers_mil, conf.level = .95)


    One Sample t-test

data:  yt30$subscribers_mil
t = 10.992, df = 29, p-value = 0.000000000007401
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 16.52828 24.08505
sample estimates:
mean of x 
 20.30667

t.test(yt30$subscribers_mil, conf.level = .99)


    One Sample t-test

data:  yt30$subscribers_mil
t = 10.992, df = 29, p-value = 0.000000000007401
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
 15.21448 25.39885
sample estimates:
mean of x 
 20.30667

How does changing the Sample Size change the interval?

A larger sample size leads to more precise information so interval becomes more NARROW.
- EXCEPT if new sample has much larger sample standard deviation by random chance.
n, the sample size affects TWO components of E, the margin of error.
As n increases, \(\frac{S}{\sqrt{n}}\) decreases because n, the sample size is in the denominator
As n increases, df = n-1 increases, and t distribution becomes more narrow
In HW 5, we examine samples of n = 31, n = 61, and n = 100 with differing sample standard deviations.
- These questions (9 - 14) illustrate how a change in sample size and differences in the sample standard deviation, affect the width of the interval.

Key Points from Today

If population standard deviation (\(\sigma\)) is unknown, we use t distribution to answer questions.
- \(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
- If we have data, we use t.test command to estimate confidence interval.
- If we only have summary statistics and we know the confidence level we want, we can calculate, E, the margin of error.
We estimate confidence intervals to provide information about true mean.
- If we estimate a 95% Confidence Interval, we are 95% confident that estimate interval contains true UNKNOWN mean.
  - Common levels for Confidence Intervals: 80%, 90%, 95%, 99%

To submit an Engagement Question or Comment about material from Lecture 14: Submit it by midnight today (day of lecture).