Lecture 12 - Intro. to the t Distribution and Confidence Intervals

Penelope Pooler Eisenbies
MAS 261

2023-10-16

Housekeeping

  • Today’s plan 📋

    • Comments about Quiz 1 (limited - one more makeup)

    • A few minutes for R Questions 🪄

    • Quick Review of the Sampling Distribution of the Sample Mean

    • Quick Review of the CLT

    • Comparing the t distribution to the Z distribution - VERY similar

    • Comparing the t statistic to the Z score - ALMOST identical

    • Answering Questions using the t distribution

    • Introduction to the concept of a Confidence Interval

    • Interpretation and Calculation Confidence Intervals

Review: R and RStudio 🪄

  • Review: You have two options to facilitate your introduction to R and RStudio:

  • If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.

    • We will use Posit Cloud for Quizzes.
  • If you are nervous about coding: Choose Option 2.

  • For both options: I can help with download/install issues during office hours.

  • What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.

  • NOTE: We will use R and RStudio in class during MOST lectures

    • You can use either Posit Cloud or your laptop.

💥 Lecture 12 In-class Exercises - Q1 (Review) 💥

Mondays are slow day at Honeycomb Bakery so they only make 32 plain croissants and hope they won’t run out.

The mean sales number for Mondays is 25 and the standard deviation is 6.

During the month of September, they found that their average Monday sales for four weeks (n = 4) was 33 croissants and they ran out a few times.


If their current estimate of average Monday sales is correct, what is the percent chance of seeing this sales figure

Sampling Distribution of the Sample Mean

Recall:

  • If X is an observation from a normal distribution with mean, \(\mu\), and standard deviation sigma, \(\sigma\). X is normally distributed.

    • \(X\sim N(\mu,\sigma)\)

    • \(Z = \frac{X-\mu}{\sigma}\)

  • \(\overline{X}\), the sample mean, is also normally distributed with mean, \(\mu\), standard deviation sigma divided by the square root of the sample size, \(\sigma/\sqrt{n}\)

    • \(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)

    • \(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)

  • The GOOD NEWS: The sample size adjustment is straightforward to include in the R commands we have covered.

CLT

  • The CLT includes a few key concepts and I highly recommend the video present by which explains these concepts using dragons.

  • For MAS 261 there is one key concept we will use:

  • If we have a sample of size 30 or more \((n \geq 30)\), then the sample mean (\(\overline{X}\)) is considered to be normally distributed, even if the population distribution is NOT normal or the population distribution is unknown.

  • Some internet sources may quibble about the minimum sample size, but 30 is recommended, especially if the distribution is unknown.

Extending These Two Concepts

  • If the population distribution is UNKNOWN, we don’t know the true population mean, \(\mu\).

  • The sample mean, \(\overline{X}\), provides a good estimate, BUT it is unlikely that \(\overline{X} = \mu\) exactly.

    • Instead we estimate a likely interval for \(\mu\) called a confidence interval centered at \(\overline{X}\).
  • We ALSO do not know the population standard deviation, \(\sigma\).

    • The sample standard deviation, \(S\), provides a good estimate, BUT again, it is unlikely that \(S = \sigma\) exactly.

    • Substituting \(S\) for \(\sigma\) means we have LESS information about the variability of the distribution.

Substituting \(S\) for \(\sigma\) - Use the t Distribution

  • Recall that if we KNOW \(sigma\) the population standard deviation then we know the distribution of the sample mean, if \(n \geq 30\):

    • \(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)

    • \(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)

  • If we DON’T KNOW \(\sigma\), we can’t use the Normal distribution.

    • The t distribution is similar to the Normal (Z) distribution but a little wider.

    • There is more area in the tails.

    • Shape (width) determined by degrees of freedom (df)

      • df = n - 1

t distribution and calculating a t Statistic

  • If the poulation standard deviation, \(\sigma\) is known we can calculate Z and use the normal distribution to answer questions:

    • \(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)

    • \(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)

  • IF the population standard deviation, \(\sigma\), is unknown:

    • Instead of Z-score we calculate a t-statistic:

    • \(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)

    • Format is almost identical to Z, but we substitute S for \(\sigma\)

    • t DOES NOT tell us how many standard deviations our value is away from the true mean because \(\sigma\) is unknown.

    • t can help determine the percent chance of seeing the sample mean IF the true population mean is \(\mu\).

Recall the Bakery Example from the Review Question

  • Situation has changed: we know that sales are normally distributed, BUT \(sigma\) is unknown.

  • Here are data for Mondays in August and September (n=8):

mnday_sales <- c(31,33,35,32,39,32,29,34)  # data


  • In the calculations below, we save the mean and sd and print them to the screen.
(mnday_xbar <- mean(mnday_sales))            # calculate and save mean
[1] 33.125
(mnday_sd <- sd(mnday_sales))                # calculate and save sd
[1] 2.997022

Calculating a t Statistic - 2 ways

  • We use these data to find our t-statistic

  • The calculation can also be done automatically using t.test command

    • Must specify what we think the mean, \(\mu\) is

    • t.test output includes additional information we will cover in the section of the course.

t Statistic Calculation:

(t <- (mnday_xbar-32)/(mnday_sd/sqrt(8))) 
[1] 1.061714

t.test Command Output:

t.test(mnday_sales, mu=32)                 

    One Sample t-test

data:  mnday_sales
t = 1.0617, df = 7, p-value = 0.3236
alternative hypothesis: true mean is not equal to 32
95 percent confidence interval:
 30.61943 35.63057
sample estimates:
mean of x 
   33.125 

Interpreting t.test (Some) Output Values

t.test Command Output:

t.test(mnday_sales, mu=32)                 

    One Sample t-test

data:  mnday_sales
t = 1.0617, df = 7, p-value = 0.3236
alternative hypothesis: true mean is not equal to 32
95 percent confidence interval:
 30.61943 35.63057
sample estimates:
mean of x 
   33.125 

Description of Output:

  • SKIP FOR NOW:

    • p-value

    • alternative hypothesis


  • t is the t-statistic we calculated

  • df is degrees of freedom

    • df = n - 1 (sample size minus one)

    • determines width of t distribution

  • 95% confidence interval

    • will introduce today
  • mean of x is the sample mean

💥 Lecture 12 In-class Exercises - Q2 💥

If the true mean sales for Mondays are 32 croissants and the population, \(\sigma\) is unknown, what is the percent chance we would see this average sales number or a higher sales average?

  • In order to use the vdist_t_prob command, we need the t-statistic.

💥 Lecture 12 In-class Exercises - Q3 💥

For the same bakery, the owner estimates that the true mean sales of their famous apple fritters on weekends is 40 fritters per weekend day.

They bake 42 fritters per weekend day (3.5 dozen). What is the probability they will sell out?

The sample standard deviation is 5.2 based on n = 24 weekend days

  • Value of Interest is 42 (amount baked)
  • S = 5.2
  • n = 24

Note that because we don’t have the data, we have to calculate the t-statistic

One More Bakery Example With Data

  • Honeycomb Bakery is also known for their amazing doughnuts.

  • On weekend mornings they make and sell many different varieties, and a popular kind is the Maple Chocolate Chip doughnut.

  • Here are their weekend sales for these doughnuts for the past four months (n = 32 weekend days)

How likely are these data to occur, if the true mean sales are 47 doughnuts.

💥 Lecture 12 In-class Exercises - Q4 💥

How likely are these doughnut sales to occur if the true average sales on weekend days are 47 doughnuts?

Step 1. Use t.test command. Step 2. Use vdist_t_prob command

Step 2. Use vdist_t_prob command

Introduction to Confidence Intervals

  • In these examples, we were using a population mean \(\mu\) value based on previous data.

  • This was a what the baker HYPOTHESIZED the mean to be.

  • In reality, population means are often unknown and the sample means, is only estimate.

  • Estimating a population mean with a sample mean is like being blindfolded and shooting an arrow at a tiny arrow tip sized target.

    • How often will you hit that arrow if you are blindfolded?

    • ALMOST NEVER!

  • Instead, we opt to lower are precision requirements and instead attempt to estimate an interval that most likely CONTAINS the true mean.

    • As part of the interval process, we can also estimate how confident we are that our interval contains the true mean.

Explanation of Need for Confidence Intervals

Confidence Interval for Doughnuts Data

t.test(doughnuts, mu=47)

    One Sample t-test

data:  doughnuts
t = 3.1536, df = 31, p-value = 0.003569
alternative hypothesis: true mean is not equal to 47
95 percent confidence interval:
 48.09293 52.09457
sample estimates:
mean of x 
 50.09375 
  • We are 95% confident that this interval contains the true mean.

  • Note that this entire interval fall above 47, the original hypothesized mean.

  • Based on our data and specified confidence level (95% is default):

    • we are 95% confident that the interval we estimated (48.09, 52.09) contains the true population mean.

    • We could increase our confidence level, but this requires a WIDER confidence interval which is less precise.

    • Most common confidence levels are 80%, 90%, 95%, 99%

Confidence Intervals - Helpful Trerminology

How a Confidence Interval is Estimated

  • Centered at sample mean, \(\overline{X}\).

    • Interval Lower Bound:

      • \(\overline{X}-\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)
    • Interval Upper Bound:

      • \(\overline{X}+\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)
    • Margin of Error (E):

      • \(\frac{S}{\sqrt{n}}\times t_{df, 1-\frac{\alpha}{2}}\)

More about the Margin of Error

CI calculations of Doughnuts Data

Results match t.test output

S <- sd(doughnuts)              # sample sd
t_95 <- qt(.975, 31)            #.025 = .05/2 in each tail 
(E <-  S/sqrt(31) * t_95)     # margin of error
[1] 2.032837
(LB <- mean(doughnuts) - E)   # CI Lower Bound
[1] 48.06091
(UB <- mean(doughnuts) + E)   # CI Upper Bound
[1] 52.12659
t.test(doughnuts, df=31)

    One Sample t-test

data:  doughnuts
t = 51.062, df = 31, p-value < 0.00000000000000022
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 48.09293 52.09457
sample estimates:
mean of x 
 50.09375 

Key Points from Today

  • if population standard deviation (\(\sigma\)) is unknown we use t distribution to answer questions

    • \(t = \frac{\overline{X}-\mu}{\frac{S}{\sqrt{n}}}\)

    • If we have data, we can use t.test command

    • If we only have summary statistics we use t formula above

    • vdist_t_prob helps to visualize distribution

  • We estimate confidence intervals to provide information about true mean.

    • If we estimate a 95% Confidence Interval, we are 95% confident that estimate interval contains true UNKNOWN mean.

      • Common levels for Confidence Intervals: 80%, 90%, 95%, 99%

To submit an Engagement Question or Comment about material from Lecture 8: Submit by midnight today (day of lecture). Click on Link next to the under Lecture 8