MAS 261 - Lecture 13

Intro. to the t Distribution and Confidence Intervals

Penelope Pooler Eisenbies

2024-10-07

Housekeeping

Today’s plan 📋
Comments about Quiz 1
A few minutes for R Questions 🪄
Quick Review of the Sampling Distribution of the Sample Mean
Quick Review of the CLT
Comparing the t distribution to the Z distribution - VERY similar
Comparing the t statistic to the Z score - ALMOST identical
Answering Questions using the t distribution
Introduction to the concept of a Confidence Interval
Interpretation and Calculation Confidence Intervals

R and RStudio

In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I will demo how to download completed work so that you can use this allotment efficiently.
- For those who want to go further with R/RStudio:
  - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.

💥 Lecture 13 In-class Exercises - Q1 💥

Mondays are slow day at Honeycomb Bakery so they only make 32 plain croissants and hope they won’t run out.

The mean sales number for Mondays is 25 and the standard deviation is 6.

During the month of September, they found that their average Monday sales for four weeks (n = 4) was 33 croissants and they ran out a few times.

If their current estimate of average Monday sales is correct, what is the percent chance of seeing this sales figure

Sampling Distribution of the Sample Mean

Recall:

If X is an observation from a normal distribution with mean, \(\mu\), and standard deviation sigma, \(\sigma\). X is normally distributed.
- \(X\sim N(\mu,\sigma)\)
- \(Z = \frac{X-\mu}{\sigma}\)
\(\overline{X}\), the sample mean, is also normally distributed with mean, \(\mu\), standard deviation sigma divided by the square root of the sample size, \(\sigma/\sqrt{n}\)
- \(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
- \(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
The GOOD NEWS: The sample size adjustment is straightforward to include in the R commands we have covered.

Central Limit Theorem (CLT)

The CLT includes a few key concepts.
I highly recommend the video I provided in Lecture 9 that explains these concepts using dragons.
For MAS 261 there is one key concept we will use:
If we have a sample of size 30 or more \((n \geq 30)\), then the sample mean (\(\overline{X}\)) is considered to be normally distributed, even if the population distribution is NOT normal or the population distribution is unknown.
Some internet sources may quibble about the minimum sample size, but 30 is recommended, especially if the distribution is unknown.

Extending These Two Concepts

If the population distribution is UNKNOWN, we don’t know the true population mean, \(\mu\).
The sample mean, \(\overline{X}\), provides a good estimate, BUT it is unlikely that \(\overline{X} = \mu\) exactly.
- Instead we estimate a likely interval for \(\mu\) called a confidence interval centered at \(\overline{X}\).
We ALSO do not know the population standard deviation, \(\sigma\).
- The sample standard deviation, \(S\), provides a good estimate, BUT again, it is unlikely that \(S = \sigma\) exactly.
- Substituting \(S\) for \(\sigma\) means we have LESS information about the variability of the distribution.

Substituting \(S\) for \(\sigma\) - t Distribution

Recall that if we KNOW \(sigma\), the population standard deviation, then we know the distribution of the sample mean, if \(n \geq 30\):
- \(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
- \(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
If we DON’T KNOW \(\sigma\), we can’t use the Normal distribution.
- The t distribution is similar to the Normal (Z) distribution but a little wider.
- There is more area in the tails.
- Shape (width) determined by degrees of freedom (df)
  - df = n - 1

t distribution and calculating a t Statistic

If the poulation standard deviation, \(\sigma\) is known we can calculate Z and use the normal distribution to answer questions:
- \(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
- \(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
IF the population standard deviation, \(\sigma\), is unknown:
- Instead of Z-score we calculate a t-statistic:
- \(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
- Format is almost identical to Z, but we substitute S for \(\sigma\)
- t DOES NOT tell us how many standard deviations our value is away from the true mean because \(\sigma\) is unknown.
- t can help determine the percent chance of seeing the sample mean IF the true population mean is \(\mu\).

Bakery Example from the Review Question

Situation has changed: we know that sales are normally distributed, BUT \(sigma\) is unknown.
Here are data for Mondays in August and September (n=8):

mnday_sales <- c(31,33,35,32,39,32,29,34)  # data

In the calculations below, we save the mean and sd and print them to the screen.

(mnday_xbar <- mean(mnday_sales))            # calculate and save mean

[1] 33.125

(mnday_sd <- sd(mnday_sales))                # calculate and save sd

[1] 2.997022

Calculating a t Statistic - 2 ways

We use these data to find our t-statistic
The calculation can also be done automatically using t.test command
- Must specify what we think the mean, \(\mu\) is
- t.test output includes additional information we will cover in this course.

t Statistic Calculation:

(t <- (mnday_xbar-32)/(mnday_sd/sqrt(8)))

[1] 1.061714

t.test Command Output:

t.test(mnday_sales, mu=32)


    One Sample t-test

data:  mnday_sales
t = 1.0617, df = 7, p-value = 0.3236
alternative hypothesis: true mean is not equal to 32
95 percent confidence interval:
 30.61943 35.63057
sample estimates:
mean of x 
   33.125

Interpreting `t.test` (Some) Output Values

t.test Command Output:

t.test(mnday_sales, mu=32)


    One Sample t-test

data:  mnday_sales
t = 1.0617, df = 7, p-value = 0.3236
alternative hypothesis: true mean is not equal to 32
95 percent confidence interval:
 30.61943 35.63057
sample estimates:
mean of x 
   33.125

Description of Output:

SKIP FOR NOW:
- p-value
- alternative hypothesis
Today’s Lecture:
t is the t-statistic we calculated
df is degrees of freedom
- df = n - 1 (sample size minus one)
- determines width of t distribution
95% confidence interval
- will introduce today
mean of x is the sample mean

💥 Lecture 13 In-class Exercises - Q2 💥

If the true mean sales for Mondays are 32 croissants and the population, \(\sigma\) is unknown, what is the percent chance we would see this average sales number or a higher sales average?

In order to use the vdist_t_prob command, we need the t-statistic.

💥 Lecture 13 In-class Exercises - Q3 💥

For the same bakery, the owner estimates that the true mean sales of their famous apple fritters on weekends is 40 fritters per weekend day.

They bake 42 fritters per weekend day (3.5 dozen). What is the probability they will sell out?

The sample standard deviation is 5.2 based on n = 24 weekend days

Value of Interest is 42 (amount baked)
S = 5.2
n = 24

Note that because we don’t have the data, we have to calculate the t-statistic

t <- (42-40)/(5.2/sqrt(24))
vdist_t_prob(t, df=23, type="upper")

One More Bakery Example With Data

Honeycomb Bakery is also known for their amazing doughnuts.
On weekend mornings they make and sell many different varieties, and a popular kind is the Maple Chocolate Chip doughnut.
Here are their weekend sales for these doughnuts for the past four months (n = 32 weekend days)

doughnuts <- c(59, 41, 56, 47, 39, 55, 56, 52, 45, 49, 53, 48, 61, 55, 60, 48, 46, 48, 47, 47, 42, 57, 50, 55, 49, 48, 49, 53, 48, 45, 43, 52)

How likely are these data to occur, if the true mean sales are 47 doughnuts.

💥 Lecture 13 In-class Exercises - Q4 💥

How likely are these doughnut sales to occur if the true average sales on weekend days are 47 doughnuts?

Use t.test command.

t.test(doughnuts, mu=47)

Use vdist_t_prob command

vdist_t_prob(3.1536, df=31, type="upper")

Introduction to Confidence Intervals

In these examples, we were using a population mean \(\mu\) value based on previous data.
This was a what the baker HYPOTHESIZED the mean to be.
In reality, population means are often unknown and the sample means, is only estimate.
Estimating a population mean with a sample mean is like being blindfolded and shooting an arrow at a tiny arrow tip sized target.
- How often will you hit that arrow if you are blindfolded?
- ALMOST NEVER!
Instead, we opt to lower are precision requirements and instead attempt to estimate an interval that most likely CONTAINS the true mean.
- As part of the interval process, we can also estimate how confident we are that our interval contains the true mean.

Explanation of Need for Confidence Intervals

Confidence Interval for Doughnuts Data

t.test(doughnuts, mu=47)


    One Sample t-test

data:  doughnuts
t = 3.1536, df = 31, p-value = 0.003569
alternative hypothesis: true mean is not equal to 47
95 percent confidence interval:
 48.09293 52.09457
sample estimates:
mean of x 
 50.09375

We are 95% confident that this interval contains the true mean.
Note that this entire interval fall above 47, the original hypothesized mean.
Based on our data and specified confidence level (95% is default):
- we are 95% confident that the interval we estimated (48.09, 52.09) contains the true population mean.
- We could increase our confidence level, but this requires a WIDER confidence interval which is less precise.
- Most common confidence levels are 80%, 90%, 95%, 99%

Confidence Intervals - Helpful Terminology

How a Confidence Interval is Estimated

Centered at sample mean, \(\overline{X}\).
- Interval Lower Bound:
  - \(\overline{X}-\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)
- Interval Upper Bound:
  - \(\overline{X}+\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)
- Margin of Error (E):
  - \(\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)

More about the Margin of Error

CI calculations of Doughnuts Data

Results match t.test output

S <- sd(doughnuts)              # sample sd
t_95 <- qt(.975, 31)            #.025 = .05/2 in each tail 
(E <-  S/sqrt(31) * t_95)     # margin of error

[1] 2.032837

(LB <- mean(doughnuts) - E)   # CI Lower Bound

[1] 48.06091

(UB <- mean(doughnuts) + E)   # CI Upper Bound

[1] 52.12659

t.test(doughnuts, df=31)


    One Sample t-test

data:  doughnuts
t = 51.062, df = 31, p-value < 0.00000000000000022
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 48.09293 52.09457
sample estimates:
mean of x 
 50.09375

Key Points from Today

If population standard deviation (\(\sigma\)) is unknown we use t distribution to answer questions
- \(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
- If we have data, we can use t.test command
- If we only have summary statistics we use t formula above
- vdist_t_prob helps to visualize distribution
We estimate confidence intervals to provide information about true mean.
- If we estimate a 95% Confidence Interval, we are 95% confident that estimate interval contains true UNKNOWN mean.
  - Common levels for Confidence Intervals: 80%, 90%, 95%, 99%

To submit an Engagement Question or Comment about material from Lecture 13: Submit it by midnight today (day of lecture).