Lecture 12 - Intro. to the t Distribution and Confidence Intervals

Penelope Pooler Eisenbies
MAS 261

2023-10-16

Housekeeping

Today’s plan 📋
- Comments about Quiz 1 (limited - one more makeup)
- A few minutes for R Questions 🪄
- Quick Review of the Sampling Distribution of the Sample Mean
- Quick Review of the CLT
- Comparing the t distribution to the Z distribution - VERY similar
- Comparing the t statistic to the Z score - ALMOST identical
- Answering Questions using the t distribution
- Introduction to the concept of a Confidence Interval
- Interpretation and Calculation Confidence Intervals

Review: R and RStudio 🪄

Review: You have two options to facilitate your introduction to R and RStudio:
- Option 1: Create Posit Cloud account and download and install R and RStudio on your laptop.
- Option 2: Start with free Posit Cloud account and use that and later transition to using R/Rstudio on your laptop.
If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.
- We will use Posit Cloud for Quizzes.
If you are nervous about coding: Choose Option 2.
For both options: I can help with download/install issues during office hours.
What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.
NOTE: We will use R and RStudio in class during MOST lectures
- You can use either Posit Cloud or your laptop.

💥 Lecture 12 In-class Exercises - Q1 (Review) 💥

Mondays are slow day at Honeycomb Bakery so they only make 32 plain croissants and hope they won’t run out.

The mean sales number for Mondays is 25 and the standard deviation is 6.

During the month of September, they found that their average Monday sales for four weeks (n = 4) was 33 croissants and they ran out a few times.

If their current estimate of average Monday sales is correct, what is the percent chance of seeing this sales figure

Sampling Distribution of the Sample Mean

Recall:

If X is an observation from a normal distribution with mean, \(\mu\), and standard deviation sigma, \(\sigma\). X is normally distributed.
- \(X\sim N(\mu,\sigma)\)
- \(Z = \frac{X-\mu}{\sigma}\)
\(\overline{X}\), the sample mean, is also normally distributed with mean, \(\mu\), standard deviation sigma divided by the square root of the sample size, \(\sigma/\sqrt{n}\)
- \(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
- \(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
The GOOD NEWS: The sample size adjustment is straightforward to include in the R commands we have covered.

CLT

The CLT includes a few key concepts and I highly recommend the video present by which explains these concepts using dragons.
For MAS 261 there is one key concept we will use:
If we have a sample of size 30 or more \((n \geq 30)\), then the sample mean (\(\overline{X}\)) is considered to be normally distributed, even if the population distribution is NOT normal or the population distribution is unknown.
Some internet sources may quibble about the minimum sample size, but 30 is recommended, especially if the distribution is unknown.

Extending These Two Concepts

If the population distribution is UNKNOWN, we don’t know the true population mean, \(\mu\).
The sample mean, \(\overline{X}\), provides a good estimate, BUT it is unlikely that \(\overline{X} = \mu\) exactly.
- Instead we estimate a likely interval for \(\mu\) called a confidence interval centered at \(\overline{X}\).
We ALSO do not know the population standard deviation, \(\sigma\).
- The sample standard deviation, \(S\), provides a good estimate, BUT again, it is unlikely that \(S = \sigma\) exactly.
- Substituting \(S\) for \(\sigma\) means we have LESS information about the variability of the distribution.

Substituting \(S\) for \(\sigma\) - Use the t Distribution

Recall that if we KNOW \(sigma\) the population standard deviation then we know the distribution of the sample mean, if \(n \geq 30\):
- \(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
- \(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
If we DON’T KNOW \(\sigma\), we can’t use the Normal distribution.
- The t distribution is similar to the Normal (Z) distribution but a little wider.
- There is more area in the tails.
- Shape (width) determined by degrees of freedom (df)
  - df = n - 1

t distribution and calculating a t Statistic

If the poulation standard deviation, \(\sigma\) is known we can calculate Z and use the normal distribution to answer questions:
- \(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
- \(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
IF the population standard deviation, \(\sigma\), is unknown:
- Instead of Z-score we calculate a t-statistic:
- \(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
- Format is almost identical to Z, but we substitute S for \(\sigma\)
- t DOES NOT tell us how many standard deviations our value is away from the true mean because \(\sigma\) is unknown.
- t can help determine the percent chance of seeing the sample mean IF the true population mean is \(\mu\).

Recall the Bakery Example from the Review Question

Situation has changed: we know that sales are normally distributed, BUT \(sigma\) is unknown.
Here are data for Mondays in August and September (n=8):

mnday_sales <- c(31,33,35,32,39,32,29,34)  # data

In the calculations below, we save the mean and sd and print them to the screen.

(mnday_xbar <- mean(mnday_sales))            # calculate and save mean

[1] 33.125

(mnday_sd <- sd(mnday_sales))                # calculate and save sd

[1] 2.997022

Calculating a t Statistic - 2 ways

We use these data to find our t-statistic
The calculation can also be done automatically using t.test command
- Must specify what we think the mean, \(\mu\) is
- t.test output includes additional information we will cover in the section of the course.

t Statistic Calculation:

(t <- (mnday_xbar-32)/(mnday_sd/sqrt(8)))

[1] 1.061714

t.test Command Output:

t.test(mnday_sales, mu=32)


    One Sample t-test

data:  mnday_sales
t = 1.0617, df = 7, p-value = 0.3236
alternative hypothesis: true mean is not equal to 32
95 percent confidence interval:
 30.61943 35.63057
sample estimates:
mean of x 
   33.125

Interpreting `t.test` (Some) Output Values

t.test Command Output:

t.test(mnday_sales, mu=32)


    One Sample t-test

data:  mnday_sales
t = 1.0617, df = 7, p-value = 0.3236
alternative hypothesis: true mean is not equal to 32
95 percent confidence interval:
 30.61943 35.63057
sample estimates:
mean of x 
   33.125

Description of Output:

SKIP FOR NOW:
- p-value
- alternative hypothesis

t is the t-statistic we calculated
df is degrees of freedom
- df = n - 1 (sample size minus one)
- determines width of t distribution
95% confidence interval
- will introduce today
mean of x is the sample mean

💥 Lecture 12 In-class Exercises - Q2 💥

If the true mean sales for Mondays are 32 croissants and the population, \(\sigma\) is unknown, what is the percent chance we would see this average sales number or a higher sales average?

In order to use the vdist_t_prob command, we need the t-statistic.

💥 Lecture 12 In-class Exercises - Q3 💥

For the same bakery, the owner estimates that the true mean sales of their famous apple fritters on weekends is 40 fritters per weekend day.

They bake 42 fritters per weekend day (3.5 dozen). What is the probability they will sell out?

The sample standard deviation is 5.2 based on n = 24 weekend days

Value of Interest is 42 (amount baked)
S = 5.2
n = 24

Note that because we don’t have the data, we have to calculate the t-statistic

One More Bakery Example With Data

Honeycomb Bakery is also known for their amazing doughnuts.
On weekend mornings they make and sell many different varieties, and a popular kind is the Maple Chocolate Chip doughnut.
Here are their weekend sales for these doughnuts for the past four months (n = 32 weekend days)

How likely are these data to occur, if the true mean sales are 47 doughnuts.

💥 Lecture 12 In-class Exercises - Q4 💥

How likely are these doughnut sales to occur if the true average sales on weekend days are 47 doughnuts?

Step 1. Use t.test command. Step 2. Use vdist_t_prob command

Step 2. Use vdist_t_prob command

Introduction to Confidence Intervals

In these examples, we were using a population mean \(\mu\) value based on previous data.
This was a what the baker HYPOTHESIZED the mean to be.
In reality, population means are often unknown and the sample means, is only estimate.
Estimating a population mean with a sample mean is like being blindfolded and shooting an arrow at a tiny arrow tip sized target.
- How often will you hit that arrow if you are blindfolded?
- ALMOST NEVER!
Instead, we opt to lower are precision requirements and instead attempt to estimate an interval that most likely CONTAINS the true mean.
- As part of the interval process, we can also estimate how confident we are that our interval contains the true mean.

Explanation of Need for Confidence Intervals

Confidence Interval for Doughnuts Data

t.test(doughnuts, mu=47)


    One Sample t-test

data:  doughnuts
t = 3.1536, df = 31, p-value = 0.003569
alternative hypothesis: true mean is not equal to 47
95 percent confidence interval:
 48.09293 52.09457
sample estimates:
mean of x 
 50.09375

We are 95% confident that this interval contains the true mean.
Note that this entire interval fall above 47, the original hypothesized mean.
Based on our data and specified confidence level (95% is default):
- we are 95% confident that the interval we estimated (48.09, 52.09) contains the true population mean.
- We could increase our confidence level, but this requires a WIDER confidence interval which is less precise.
- Most common confidence levels are 80%, 90%, 95%, 99%

Confidence Intervals - Helpful Trerminology

How a Confidence Interval is Estimated

Centered at sample mean, \(\overline{X}\).
- Interval Lower Bound:
  - \(\overline{X}-\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)
- Interval Upper Bound:
  - \(\overline{X}+\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)
- Margin of Error (E):
  - \(\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)

More about the Margin of Error

CI calculations of Doughnuts Data

Results match t.test output

S <- sd(doughnuts)              # sample sd
t_95 <- qt(.975, 31)            #.025 = .05/2 in each tail 
(E <-  S/sqrt(31) * t_95)     # margin of error

[1] 2.032837

(LB <- mean(doughnuts) - E)   # CI Lower Bound

[1] 48.06091

(UB <- mean(doughnuts) + E)   # CI Upper Bound

[1] 52.12659

t.test(doughnuts, df=31)


    One Sample t-test

data:  doughnuts
t = 51.062, df = 31, p-value < 0.00000000000000022
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 48.09293 52.09457
sample estimates:
mean of x 
 50.09375

Key Points from Today

if population standard deviation (\(\sigma\)) is unknown we use t distribution to answer questions
- \(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)
- If we have data, we can use t.test command
- If we only have summary statistics we use t formula above
- vdist_t_prob helps to visualize distribution
We estimate confidence intervals to provide information about true mean.
- If we estimate a 95% Confidence Interval, we are 95% confident that estimate interval contains true UNKNOWN mean.
  - Common levels for Confidence Intervals: 80%, 90%, 95%, 99%

To submit an Engagement Question or Comment about material from Lecture 8: Submit by midnight today (day of lecture). Click on Link next to the ❓ under Lecture 8