Intro. to the t Distribution and Confidence Intervals
2024-10-07
Today’s plan 📋
Comments about Quiz 1
A few minutes for R Questions 🪄
Quick Review of the Sampling Distribution of the Sample Mean
Quick Review of the CLT
Comparing the t distribution to the Z distribution - VERY similar
Comparing the t statistic to the Z score - ALMOST identical
Answering Questions using the t distribution
Introduction to the concept of a Confidence Interval
Interpretation and Calculation Confidence Intervals
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I will demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
Mondays are slow day at Honeycomb Bakery so they only make 32 plain croissants and hope they won’t run out.
The mean sales number for Mondays is 25 and the standard deviation is 6.
During the month of September, they found that their average Monday sales for four weeks (n = 4) was 33 croissants and they ran out a few times.
If their current estimate of average Monday sales is correct, what is the percent chance of seeing this sales figure
Recall:
If X is an observation from a normal distribution with mean, \(\mu\), and standard deviation sigma, \(\sigma\). X is normally distributed.
\(X\sim N(\mu,\sigma)\)
\(Z = \frac{X-\mu}{\sigma}\)
\(\overline{X}\), the sample mean, is also normally distributed with mean, \(\mu\), standard deviation sigma divided by the square root of the sample size, \(\sigma/\sqrt{n}\)
\(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
\(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
The GOOD NEWS: The sample size adjustment is straightforward to include in the R commands we have covered.
The CLT includes a few key concepts.
I highly recommend the video I provided in Lecture 9 that explains these concepts using dragons.
For MAS 261 there is one key concept we will use:
If we have a sample of size 30 or more \((n \geq 30)\), then the sample mean (\(\overline{X}\)) is considered to be normally distributed, even if the population distribution is NOT normal or the population distribution is unknown.
Some internet sources may quibble about the minimum sample size, but 30 is recommended, especially if the distribution is unknown.
If the population distribution is UNKNOWN, we don’t know the true population mean, \(\mu\).
The sample mean, \(\overline{X}\), provides a good estimate, BUT it is unlikely that \(\overline{X} = \mu\) exactly.
We ALSO do not know the population standard deviation, \(\sigma\).
The sample standard deviation, \(S\), provides a good estimate, BUT again, it is unlikely that \(S = \sigma\) exactly.
Substituting \(S\) for \(\sigma\) means we have LESS information about the variability of the distribution.
Recall that if we KNOW \(sigma\), the population standard deviation, then we know the distribution of the sample mean, if \(n \geq 30\):
\(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
\(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
If we DON’T KNOW \(\sigma\), we can’t use the Normal distribution.
The t distribution is similar to the Normal (Z) distribution but a little wider.
There is more area in the tails.
Shape (width) determined by degrees of freedom (df)
If the poulation standard deviation, \(\sigma\) is known we can calculate Z and use the normal distribution to answer questions:
\(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
\(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
IF the population standard deviation, \(\sigma\), is unknown:
Instead of Z-score we calculate a t-statistic:
\(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
Format is almost identical to Z, but we substitute S for \(\sigma\)
t DOES NOT tell us how many standard deviations our value is away from the true mean because \(\sigma\) is unknown.
t can help determine the percent chance of seeing the sample mean IF the true population mean is \(\mu\).
Situation has changed: we know that sales are normally distributed, BUT \(sigma\) is unknown.
Here are data for Mondays in August and September (n=8):
We use these data to find our t-statistic
The calculation can also be done automatically using t.test
command
Must specify what we think the mean, \(\mu\) is
t.test
output includes additional information we will cover in this course.
t.test
(Some) Output ValuesDescription of Output:
SKIP FOR NOW:
p-value
alternative hypothesis
Today’s Lecture:
t is the t-statistic we calculated
df is degrees of freedom
df = n - 1 (sample size minus one)
determines width of t distribution
95% confidence interval
mean of x is the sample mean
If the true mean sales for Mondays are 32 croissants and the population, \(\sigma\) is unknown, what is the percent chance we would see this average sales number or a higher sales average?
vdist_t_prob
command, we need the t-statistic.For the same bakery, the owner estimates that the true mean sales of their famous apple fritters on weekends is 40 fritters per weekend day.
They bake 42 fritters per weekend day (3.5 dozen). What is the probability they will sell out?
The sample standard deviation is 5.2 based on n = 24 weekend days
Note that because we don’t have the data, we have to calculate the t-statistic
Honeycomb Bakery is also known for their amazing doughnuts.
On weekend mornings they make and sell many different varieties, and a popular kind is the Maple Chocolate Chip doughnut.
Here are their weekend sales for these doughnuts for the past four months (n = 32 weekend days)
How likely are these doughnut sales to occur if the true average sales on weekend days are 47 doughnuts?
t.test
command.vdist_t_prob
commandIn these examples, we were using a population mean \(\mu\) value based on previous data.
This was a what the baker HYPOTHESIZED the mean to be.
In reality, population means are often unknown and the sample means, is only estimate.
Estimating a population mean with a sample mean is like being blindfolded and shooting an arrow at a tiny arrow tip sized target.
How often will you hit that arrow if you are blindfolded?
ALMOST NEVER!
Instead, we opt to lower are precision requirements and instead attempt to estimate an interval that most likely CONTAINS the true mean.
One Sample t-test
data: doughnuts
t = 3.1536, df = 31, p-value = 0.003569
alternative hypothesis: true mean is not equal to 47
95 percent confidence interval:
48.09293 52.09457
sample estimates:
mean of x
50.09375
We are 95% confident that this interval contains the true mean.
Note that this entire interval fall above 47, the original hypothesized mean.
Based on our data and specified confidence level (95% is default):
we are 95% confident that the interval we estimated (48.09, 52.09) contains the true population mean.
We could increase our confidence level, but this requires a WIDER confidence interval which is less precise.
Most common confidence levels are 80%, 90%, 95%, 99%
Centered at sample mean, \(\overline{X}\).
Interval Lower Bound:
Interval Upper Bound:
Margin of Error (E):
Results match t.test
output
S <- sd(doughnuts) # sample sd
t_95 <- qt(.975, 31) #.025 = .05/2 in each tail
(E <- S/sqrt(31) * t_95) # margin of error
[1] 2.032837
[1] 48.06091
[1] 52.12659
One Sample t-test
data: doughnuts
t = 51.062, df = 31, p-value < 0.00000000000000022
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
48.09293 52.09457
sample estimates:
mean of x
50.09375
If population standard deviation (\(\sigma\)) is unknown we use t distribution to answer questions
\(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
If we have data, we can use t.test
command
If we only have summary statistics we use t formula above
vdist_t_prob
helps to visualize distribution
We estimate confidence intervals to provide information about true mean.
If we estimate a 95% Confidence Interval, we are 95% confident that estimate interval contains true UNKNOWN mean.
To submit an Engagement Question or Comment about material from Lecture 13: Submit it by midnight today (day of lecture).