MAS 261 - Lecture 13
Intro. to the t Distribution and Confidence Intervals
Housekeeping
Today’s plan:
Comments about Quiz 1
A few minutes for R Questions 🪄
Quick Review of the Sampling Distribution of the Sample Mean
Quick Review of the CLT
Comparing the t distribution to the Z distribution - VERY similar
Comparing the t statistic to the Z score - ALMOST identical
Answering Questions using the t distribution
Introduction to the concept of a Confidence Interval
Interpretation and Calculation Confidence Intervals
R and RStudio
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
For those who want to go further with R/RStudio:
If you are interested in downloading R and RStudio to your own computer, I can guide you through the process.
The software is completely free but it does have to be updated a couple times each year.
Lecture 13 In-class Exercises - Q1
Poll Everywhere - My User Name: penelopepoolereisenbies685
Mondays are a slow day at Honeycomb Bakery so they only make 32 plain croissants and hope they won’t run out.
The mean sales number for Mondays is 25 and the standard deviation is 6.
During the month of September, they found that their average Monday sales for four weeks (n = 4) was 33 croissants and they ran out a few times.
If their current estimate of average Monday sales is correct, what is the percent chance of seeing this sales figure
Sampling Distribution of the Sample Mean
Recall:
If X is an observation from a normal distribution with mean, \(\mu\), and standard deviation sigma, \(\sigma\). X is normally distributed.
\(X\sim N(\mu,\sigma)\)
\(Z = \frac{X-\mu}{\sigma}\)
\(\overline{X}\), the sample mean, is also normally distributed with mean, \(\mu\), standard deviation sigma divided by the square root of the sample size, \(\sigma/\sqrt{n}\)
\(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
\(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
The GOOD NEWS: The sample size adjustment is straightforward to include in the R commands we have covered.
Central Limit Theorem (CLT)
The CLT includes a few key concepts.
I highly recommend the video I provided in Lecture 9 that explains these concepts using dragons.
For MAS 261 there is one key concept we will use:
If we have a sample of size 30 or more \((n \geq 30)\), then the sample mean (\(\overline{X}\)) is considered to be normally distributed, even if the population distribution is NOT normal or the population distribution is unknown.
Some internet sources may quibble about the minimum sample size, but 30 is recommended, especially if the distribution is unknown.
Extending These Two Concepts
If the population distribution is UNKNOWN, we don’t know the true population mean, \(\mu\).
The sample mean, \(\overline{X}\), provides a good estimate, BUT it is unlikely that \(\overline{X} = \mu\) exactly.
- Instead we estimate a likely interval for \(\mu\) called a confidence interval centered at \(\overline{X}\).
We ALSO do not know the population standard deviation, \(\sigma\).
The sample standard deviation, \(S\), provides a good estimate, BUT again, it is unlikely that \(S = \sigma\) exactly.
Substituting \(S\) for \(\sigma\) means we have LESS information about the variability of the distribution.
Substituting \(S\) for \(\sigma\) - t Distribution
Recall that if we KNOW \(sigma\), the population standard deviation, then we know the distribution of the sample mean, if \(n \geq 30\):
\(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
\(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
If we DON’T KNOW \(\sigma\), we can’t use the Normal distribution.
The t distribution is similar to the Normal (Z) distribution but a little wider.
There is more area in the tails.
Shape (width) determined by degrees of freedom (df)
- df = n - 1, sample size minus 1
t distribution and calculating a t Statistic
If the poulation standard deviation, \(\sigma\) is known we can calculate Z and use the normal distribution to answer questions:
\(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
\(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
IF the population standard deviation, \(\sigma\), is unknown:
Instead of Z-score we calculate a t-statistic:
\(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
Format is almost identical to Z, but we substitute S for \(\sigma\)
t DOES NOT tell us how many standard deviations our value is away from the true mean because \(\sigma\) is unknown.
t can help determine the percent chance of seeing the sample mean IF the true population mean is \(\mu\).
Bakery Example from the Review Question
Situation has changed: we know that sales are normally distributed, BUT \(sigma\) is unknown.
Here are data for Mondays in August and September (n=8):
- In the calculations below, we save the mean and sd and print them to the screen.
Calculating a t Statistic - 2 ways
We use these data to find our t-statistic
The calculation can also be done automatically using
t.testcommandMust specify what we think the mean, \(\mu\) is
t.testoutput includes additional information we will cover in this course.
t Statistic Calculation:
Interpreting t.test (Some) Output Values
Description of Output:
SKIP FOR NOW:
p-value
alternative hypothesis
Today’s Lecture:
t is the t-statistic we calculated
df is degrees of freedom
df = n - 1 (sample size minus one)
determines width of t distribution
95% confidence interval
- will introduce today
mean of x is the sample mean
Lecture 13 In-class Exercises - Q2
Poll Everywhere - My User Name: penelopepoolereisenbies685
If the true mean sales for Mondays are 32 croissants and the population, \(\sigma\) is unknown, what is the percent chance we would see this average sales number or a higher sales average?
- In order to use the
vdist_t_probcommand, we need the t-statistic.
Lecture 13 In-class Exercises - Q3
Poll Everywhere - My User Name: penelopepoolereisenbies685
For the same bakery, the owner estimates that the true mean sales of their famous apple fritters on weekends is 40 fritters per weekend day.
They bake 42 fritters per weekend day (3.5 dozen). What is the probability they will sell out?
The sample standard deviation is 5.2 based on n = 24 weekend days
- Value of Interest is 42 (amount baked)
- S = 5.2
- n = 24
Note that because we don’t have the data, we have to calculate the t-statistic
One More Bakery Example With Data
Honeycomb Bakery is also known for their amazing doughnuts.
On weekend mornings they make and sell many different varieties, and a popular kind is the Maple Chocolate Chip doughnut.
Here are their weekend sales for these doughnuts for the past four months (n = 32 weekend days)
- How likely are these data to occur, if the true mean sales are 47 doughnuts.
Lecture 13 In-class Exercises - Q4
Poll Everywhere - My User Name: penelopepoolereisenbies685
How likely are these doughnut sales to occur if the true average sales on weekend days are 47 doughnuts?
- Use
t.testcommand.
- Use
vdist_t_probcommand
Introduction to Confidence Intervals
In these examples, we were using a population mean \(\mu\) value based on previous data.
This was a what the baker HYPOTHESIZED the mean to be.
In reality, population means are often unknown and the sample means, is only estimate.
Estimating a population mean with a sample mean is like being blindfolded and shooting an arrow at a tiny arrow tip sized target.
How often will you hit that arrow if you are blindfolded?
ALMOST NEVER!
Instead, we opt to lower are precision requirements and instead attempt to estimate an interval that most likely CONTAINS the true mean.
- As part of the interval process, we can also estimate how confident we are that our interval contains the true mean.
Explanation of Need for Confidence Intervals
Confidence Interval for Doughnuts Data
One Sample t-test
data: doughnuts
t = 3.1536, df = 31, p-value = 0.003569
alternative hypothesis: true mean is not equal to 47
95 percent confidence interval:
48.09293 52.09457
sample estimates:
mean of x
50.09375
We are 95% confident that this interval contains the true mean.
Note that this entire interval fall above 47, the original hypothesized mean.
Based on our data and specified confidence level (95% is default):
we are 95% confident that the interval we estimated (48.09, 52.09) contains the true population mean.
We could increase our confidence level. This would result in a WIDER confidence interval which is less precise.
Most common confidence levels are 80%, 90%, 95%, 99%
Confidence Intervals - Helpful Terminology
How a Confidence Interval is Estimated
Centered at sample mean, \(\overline{X}\).
Interval Lower Bound:
- \(\overline{X}-\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)
Interval Upper Bound:
- \(\overline{X}+\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)
Margin of Error (E):
- \(\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)
More about the Margin of Error
CI calculations of Doughnuts Data
Results match t.test output
Code
[1] 2.032837
[1] 48.06091
[1] 52.12659
One Sample t-test
data: doughnuts
t = 51.062, df = 31, p-value < 0.00000000000000022
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
48.09293 52.09457
sample estimates:
mean of x
50.09375
Key Points from Today
If population standard deviation (\(\sigma\)) is unknown we use t distribution to answer questions
\(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
If we have data, we can use
t.testcommandIf we only have summary statistics we use t formula above
vdist_t_probhelps to visualize distribution
We estimate confidence intervals to provide information about true mean.
If we estimate a 95% Confidence Interval, we are 95% confident that estimate interval contains true UNKNOWN mean.
- Common levels for Confidence Intervals: 80%, 90%, 95%, 99%
To submit an Engagement Question or Comment about material from Lecture 13: Submit it by midnight today (day of lecture).