Inference for a normal population

M. Drew LaMar
November 1, 2021

“If all the statisticians in the world were laid head to toe, they wouldn't be able to reach a conclusion.”

- Anonymous

Class Announcements

  • Reading for Wednesday (LAST READING QUIZ): Chapter 11, Inference for a normal population

Class Announcements

How was Halloween?


Recall: Sampling distribution for means

Theorem: If a variable \( Y \) has a normal distribution in a population, then the distribution of sample means \( \bar{Y} \) is also normal.

Theorem: \( Y \sim N(\mu,\sigma^2) \Rightarrow \bar{Y} \sim N(\mu,\sigma_{\bar{Y}}^2) \), where \( \sigma_{\bar{Y}} \) is the standard error of the mean given by

\[ \sigma_{\bar{Y}} = \frac{\sigma}{\sqrt{n}}. \]

Recall: Sampling distribution for means

We can create a standard normal deviate from the sampling distribution as follows: \[ Z = \frac{\bar{Y}-\mu}{\sigma_{\bar{Y}}}, \] where \( \sigma_{\bar{Y}} = \frac{\sigma}{\sqrt{n}} \).

\[ -1.96 < \frac{\bar{Y}-\mu}{\sigma_{\bar{Y}}} < 1.96 \]

\[ \bar{Y} - 1.96\cdot\sigma_{\bar{Y}} < \mu < \bar{Y} + 1.96\cdot\sigma_{\bar{Y}} \]

Problem: We use \( \mathrm{SE}_{\bar{Y}} \) instead of \( \sigma_{\bar{Y}} \)!!

The Student's t-distribution

Definition: For \( Y \sim N(\mu,\sigma^2) \), the standard normal deviate
\[ Z = \frac{\bar{Y}-\mu}{\sigma_{\bar{Y}}} \] is normally distributed with mean 0 and standard deviation 1.

Definition: For \( Y \sim N(\mu,\sigma^2) \), the statistic defined by
\[ t = \frac{\bar{Y}-\mu}{\mathrm{SE}_{\bar{Y}}} \] has a Student’s \( t \)-distribution with \( n-1 \) degrees of freedom.

The Student's t-distribution

More probability in the tail for \( t \) since approximating \( \sigma_{\bar{Y}} \) (\( Z \)) by \( \mathrm{SE}_{\bar{Y}} \) (\( t \)) gives us more uncertainty.

What's with "Student"?

“Guinness brewer William S. Gosset’s work is responsible for inspiring the concept of statistical significance, industrial quality control, efficient design of experiments and, not least of all, consistently great tasting beer.”

- Dan Kopf

“Gosset used a pseudonym [Student] because Guinness prohibited its employees from publishing, following the unauthorized release of some brewing secrets a few years earlier by another employee.”

- Whitlock & Schluter

Critical values of Student's t

alt text

\[ 2\times \mathrm{Pr[}t_{4} > 2.78\mathrm{]} = 0.05 \]

\[ \mathrm{Crit.\ val.} = t_{0.05(2),4} = 2.78 \]

alt text

\[ \mathrm{Pr[}t_{4} > 2.78\mathrm{]} = 0.025 \]

\[ \mathrm{Crit.\ val.} = t_{0.05(1),4} \]

Summary of functions for t dist.

Name R command Uses
PDF dt(x, df) -
CDF pt(q, df, lower.tail=TRUE) -
CCDF pt(q, df, lower.tail=FALSE) Compute \( P \)-values
QF qt(p, df, lower.tail=TRUE) -
CQF qt(p, df, lower.tail=FALSE) Compute critical values

To compute \( P \)-values, make sure you use abs on your test statistic and multiply by two for a two-sided test!!!

pval <- 2*pt(abs(tstat), df, lower.tail=FALSE)

Critical values of Student's t - in R

alt text

\[ \mathrm{Pr[}t_{4} > 2.78\mathrm{]} = 0.025 \]

\[ \mathrm{Crit.\ val.} = t_{0.05(1),4} = 2.78 \]

qt(0.05/2, 
   df=4, 
   lower.tail=FALSE)
[1] 2.776445

Critical values of Student's t - Table C

Wait, what are we doing?

Revisiting estimation of the mean!!

Two sides of our statistical coin:

  • Estimation: 95% confidence interval for the mean
  • Hypothesis testing: One-sample \( t \)-test

Assumptions:

  1. The variable of interest is normally distributed in the population.
  2. Data are a random sample from population.

Remember, though, that the Central Limit Theorem can make our results somewhat robust to Assumption #1.

Estimation: Confidence interval for the mean (revisited)

Definition: The 95% confidence interval for the mean is given by

\[ \overline{Y} - t_{0.05(2),df}\mathrm{SE}_{\overline{Y}} < \mu < \overline{Y} + t_{0.05(2),df}\mathrm{SE}_{\overline{Y}}. \]

This comes from \[ -t_{0.05(2),df} < t_{df} = \frac{\overline{Y}-\mu}{\mathrm{SE}_{\overline{Y}}} < t_{0.05(2),df} \]

Estimation: Confidence interval for the mean (revisited)

Chapter 11, Practice Problem #1

Consider the changes in highest elevation for 31 taxa, in meters, over the late 1900s and early 2000s. Positive and negative numbers will indicate upward and downward shifts in elevation, respectively.

Practice Problem #1

str(myData)
'data.frame':   31 obs. of  2 variables:
 $ elevationalRangeShift: num  58.9 7.8 108.6 44.8 11.1 ...
 $ taxonAndLocation     : chr  "moths_Malaysia" "butterflies_Czech" "butterflies_Spain" "butterflies_UK" ...
elevation <- myData$elevationalRangeShift
summary(elevation)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -19.30   15.95   35.80   39.33   63.45  108.60 

Practice Problem #1

Some statistics - Computing standard error

n <- length(elevation)
mu <- mean(elevation)
sdev <- sd(elevation)
sderr <- sdev/sqrt(n)
matrix(c(n, mu, sdev, sderr), nrow=1, byrow=TRUE, dimnames=list("",c("Length","Mean","Sd","Sd err")))
 Length     Mean       Sd   Sd err
     31 39.32903 30.66312 5.507259

Practice Problem #1

95% confidence interval

(tcrit <- qt(0.025, df=n-1, lower.tail=FALSE))
[1] 2.042272
ci <- c(mu - tcrit*sderr, mu + tcrit*sderr)
names(ci) <- c("lower bound", "upper bound")
ci
lower bound upper bound 
   28.08171    50.57635 

Practice Problem #1

99% confidence interval

(tcrit <- qt(0.01/2, df=n-1, lower.tail=FALSE))
[1] 2.749996
ci <- c(mu - tcrit*sderr, mu + tcrit*sderr)
names(ci) <- c("lower bound", "upper bound")
ci
lower bound upper bound 
   24.18410    54.47397 

Hypothesis testing: One-sample t-test

Definition: The one-sample \( t \)-test compares the mean of a random sample from a normal population with the population mean proposed in a null hypothesis.

Null hypothesis: \( H_{0} \): The true mean equals \( \mu_{0} \) (\( \mu = \mu_{0} \))
Alternate hypothesis: \( H_{A} \): The true mean does not equal \( \mu_{0} \) (\( \mu \neq \mu_{0} \))
Test statistic: \[ t = \frac{\overline{Y}-\mu_{0}}{\mathrm{SE}_{\overline{Y}}} \] Sampling distribution of \( t \) under \( H_{0} \): \( t \)-distribution with \( df = n-1 \)

Practice Problem #1 (Elevation)

One-sample \( t \)-test

\[ H_{0}: \mu = 0 \\ H_{A}: \mu \neq 0 \]

First, let's calculate a \( P \)-value:

tstat <- (mu - 0)/sderr
(pval <- 2*pt(abs(tstat), df=n-1, lower.tail=FALSE))
[1] 6.056689e-08

Conclusion: With significance level \( \alpha = 0.05 \), since \( P < 0.05 \), we reject the null hypothesis.

Practice Problem #1 (Elevation)

One-sample \( t \)-test

\[ H_{0}: \mu = 0 \\ H_{A}: \mu \neq 0 \]

Second, using critical values:

(tstat)
[1] 7.141309
(tcrit <- qt(0.025, df=n-1, lower.tail=FALSE))
[1] 2.042272

Practice Problem #1 (Elevation)

One-sample \( t \)-test

\[ H_{0}: \mu = 0 \\ H_{A}: \mu \neq 0 \]

Third, using t.test function:

t.test(elevation, mu=0, conf.level=0.95)

Practice Problem #1 (Elevation)


    One Sample t-test

data:  elevation
t = 7.1413, df = 30, p-value = 6.057e-08
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 28.08171 50.57635
sample estimates:
mean of x 
 39.32903 

Estimating population variance

In many cases, it isn't the mean that we are interested in estimating but the variability of a population measure.

Remember, variance is also a population parameter, so we should be able to estimate it.

Stalk-eyed flies have staring contests! Longer stalked flies usually win.

alt text

Estimating population variance

Definition: If \( Y \) has a normal distribution, then the sampling distribution of the quantity \[ \chi^{2} = (n-1)s^2/\sigma^2 \] is the \( \chi^2 \) distribution with \( n-1 \) degrees of freedom.

\[ \frac{df s^2}{\chi^2_{\alpha/2,df}} < \sigma^2 < \frac{df s^2}{\chi^2_{1-\alpha/2,df}} \]

alt text

Example 11.2: Stalk-eyed flies

myData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter11/chap11e2Stalkies.csv")
eyespan <- myData$eyespan
summary(eyespan)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.150   8.630   8.690   8.778   8.960   9.450 

Example 11.2: Stalk-eyed flies

svar <- var(eyespan)
df <- length(eyespan) - 1
tcritL <- qchisq(0.025, df=df, lower.tail=TRUE)
tcritU <- qchisq(0.025, df=df, lower.tail=FALSE)
ci <- c(df*svar/tcritU, df*svar/tcritL)
names(ci) <- c("lower bound", "upper bound")
ci
lower bound upper bound 
 0.07238029  0.58225336 

Note: Same assumptions as confidence interval for mean, but much less robust to deviations from these assumptions!!!