Effect of Sample Size and Confidence level on Confidence Intervals
2024-10-09
Today’s plan 📋
More Comments about Quiz 1
A few minutes for R Questions 🪄
Review of the t distribution
Review of
t-test output
Interpretation and Calculation of Confidence Intervals
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I will demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
In 2023, the top 1000 YouTubers include 293 that are based in the United States
The global top 1000 YouTubers have an average number of subscribers of 21.89 million, but we don’t know the variability in the global data.
This dataset includes a random sample of 30 of these U.S. YouTubers.
Use t.test
and vdist_t_prob
to better understand the probability of seeing our observed US sample data.
Find the t-statistic using t.test with \(\mu = 21.89\).
Use vdist_t_prob to find probability of seeing the US sample mean or a lower mean.
\(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
Format is almost identical to Z, but we substitute S for \(\sigma\)
Interpreting the t.test output:
Recall, I mentioned that we will discuss the p-value
and hypothesis
in upcoming weeks.
In the last lecture, we began talk about confidence intervals.
Interpretation of a 95% Confidence Interval:
We are 95% sure (confident) that the interval estimated based on sample data contains the true mean number of subscribers for US YouTubers.
From Example Data: We are 95% confident that the true mean number of subscribers for US YouTubers is between ____ and ____.
Round Answers to two decimal places.
This Interval provides estimated bounds.
We 95% confident that these estimated confidence bounds capture the true mean.
Each one of these means (in the histogram) is the center point of a confidence interval.
As we see, by random chance, some of these means may be far from true population mean.
These random outliers result in intervals that may not contain the true population mean.
If we (hypothetically) repeated this sampling procedure 10000 times, and each time we estimated a 95% interval:
About 9500 (.95 X 10000) of the these intervals would succeed in capturing the true population mean
about 500 (.05 * 10000) of these interval would NOT capture the true mean
If we repeated the sampling procedure for the US YouTube data 9000 times, and each time we calculated a 90% confidence interval.
Based on what we know of about confidence intervals, and the sampling distribution of the sample mean, how many of these 9000 intervals would NOT contain the true population mean, \(\mu\)?
Centered at sample mean, \(\overline{X}\).
Interval Lower Bound:
Interval Upper Bound:
Margin of Error (E):
Results match t.test
output
[1] 0.975
[1] 3.778387
[1] 16.52828
[1] 24.08505
One Sample t-test
data: yt30$subscribers_mil
t = 10.992, df = 29, p-value = 0.000000000007401
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
16.52828 24.08505
sample estimates:
mean of x
20.30667
If we estimate a 95% confidence interval, then the probability that we don’t succeed is \(\alpha = .05\) or 5%.
That 5% is split evenly between the two tails of the distribution
0.025 or 2.5% in each tail
To find the correct t value for a confidence interval:
Find \(\alpha\) and \(1 - \frac{\alpha}{2}\)
Determine degree of freedom, \(df = n - 1\)
Example using 95% CI for YouTube data:
\(\alpha = .05\)
\(1 - \frac{\alpha}{2}=1-\frac{.05}{2}=1-.025=.975\)
\(df = n-1 =30-1 = 29\)
The diagram below shows how the 95% confidence interval relates to the t distribution.
In this course, we advocate for computer calculations instead of old-fashioned paper tables.
BUT this t table is exceptional.
It shows Confidence Levels at the Bottom and \(\frac{\alpha}{2}\) at the top.
It shows the Confidence Interval t value for a number of degrees of freedom
It can help you check your work as you experiment with using the qt
function.
In this course, we can use t-test
to find confidence interval bounds
BUT you will also be expected to understand how the interval is constructed
All Common Intervals: 80%, 90%, 95%, 99%
One Sample t-test
data: yt30$subscribers_mil
t = 10.992, df = 29, p-value = 0.000000000007401
alternative hypothesis: true mean is not equal to 0
80 percent confidence interval:
17.88390 22.72943
sample estimates:
mean of x
20.30667
One Sample t-test
data: yt30$subscribers_mil
t = 10.992, df = 29, p-value = 0.000000000007401
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
17.16767 23.44566
sample estimates:
mean of x
20.30667
One Sample t-test
data: yt30$subscribers_mil
t = 10.992, df = 29, p-value = 0.000000000007401
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
16.52828 24.08505
sample estimates:
mean of x
20.30667
One Sample t-test
data: yt30$subscribers_mil
t = 10.992, df = 29, p-value = 0.000000000007401
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
15.21448 25.39885
sample estimates:
mean of x
20.30667
A larger sample size leads to more precise information so interval becomes more NARROW.
n, the sample size affects TWO components of E, the margin of error.
As n increases, \(\frac{S}{\sqrt{n}}\) decreases because n, the sample size is in the denominator
As n increases, df = n-1 increases, and t distribution becomes more narrow
In HW 5, we examine samples of n = 31, n = 61, and n = 100 with differing sample standard deviations.
If population standard deviation (\(\sigma\)) is unknown, we use t distribution to answer questions.
\(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
If we have data, we use t.test
command to estimate confidence interval.
If we only have summary statistics and we know the confidence level we want, we can calculate, E, the margin of error.
We estimate confidence intervals to provide information about true mean.
If we estimate a 95% Confidence Interval, we are 95% confident that estimate interval contains true UNKNOWN mean.
To submit an Engagement Question or Comment about material from Lecture 14: Submit it by midnight today (day of lecture).