M. Drew LaMar
March 15, 2017
“If all the statisticians in the world were laid head to toe, they wouldn't be able to reach a conclusion.”
- Anonymous
Theorem: If a variable \( Y \) has a normal distribution in a population, then the distribution of sample means \( \bar{Y} \) is also normal.
Theorem: \( Y \sim N(\mu,\sigma^2) \Rightarrow \bar{Y} \sim N(\mu,\sigma_{\bar{Y}}^2) \), where \( \sigma_{\bar{Y}} \) is the
standard error of the mean given by
\[ \sigma_{\bar{Y}} = \frac{\sigma}{\sqrt{n}}. \]
We can create a standard normal deviate from the sampling distribution as follows: \[ Z = \frac{\bar{Y}-\mu}{\sigma_{\bar{Y}}}, \] where \( \sigma_{\bar{Y}} = \frac{\sigma}{\sqrt{n}} \).
\[ -1.96 < \frac{\bar{Y}-\mu}{\sigma_{\bar{Y}}} < 1.96 \]
\[ \bar{Y} - 1.96\cdot\sigma_{\bar{Y}} < \mu < \bar{Y} + 1.96\cdot\sigma_{\bar{Y}} \]
Problem: We use \( \mathrm{SE}_{\bar{Y}} \) instead of \( \sigma_{\bar{Y}} \)!!
Definition: For \( Y \sim N(\mu,\sigma^2) \), the
standard normal deviate
\[ Z = \frac{Y-\mu}{\sigma} \] is normally distributed with mean 0 and standard deviation 1.
Definition: For \( Y \sim N(\mu,\sigma^2) \), the statistic defined by
\[ t = \frac{\bar{Y}-\mu}{\mathrm{SE}_{\bar{Y}}} \] has aStudent’s \( t \)-distribution with \( n-1 \) degrees of freedom.
More probability in the tail for \( t \) since approximating \( \sigma_{\bar{Y}} \) (\( Z \)) by \( \mathrm{SE}_{\bar{Y}} \) (\( t \)) gives us more uncertainty.
“Guinness brewer William S. Gosset’s work is responsible for inspiring the concept of statistical significance, industrial quality control, efficient design of experiments and, not least of all, consistently great tasting beer.”
- Dan Kopf
“Gosset used a pseudonym [Student] because Guinness prohibited its employees from publishing, following the unauthorized release of some brewing secrets a few years earlier by another employee.”
- Whitlock & Schluter
\[ 2\times \mathrm{Pr[}t_{4} > 2.78\mathrm{]} = 0.05 \]
\[ \mathrm{Crit.\ val.} = t_{0.05(2),4} = 2.78 \]
\[ \mathrm{Pr[}t_{4} > 2.78\mathrm{]} = 0.025 \]
\[ \mathrm{Crit.\ val.} = t_{0.05(1),4} \]
Name | R command | Uses |
---|---|---|
dt(x, df) |
- | |
CDF | pt(q, df, lower.tail=TRUE) |
- |
CCDF | pt(q, df, lower.tail=FALSE) |
Compute \( P \)-values |
QF | qt(p, df, lower.tail=TRUE) |
- |
CQF | qt(p, df, lower.tail=FALSE) |
Compute critical values |
To compute \( P \)-values, make sure you use abs on your test statistic and multiply by two for a two-sided test!!!
pval <- 2*pt(abs(tstat), df, lower.tail=FALSE)
\[ \mathrm{Pr[}t_{4} > 2.78\mathrm{]} = 0.025 \]
\[ \mathrm{Crit.\ val.} = t_{0.05(1),4} = 2.78 \]
qt(0.05/2,
df=4,
lower.tail=FALSE)
[1] 2.776445
Revisiting estimation of the mean!!
Two sides of our statistical coin:
Assumptions:
Remember, though, that the Central Limit Theorem can make our results somewhat robust to Assumption #1.
Definition: The
95% confidence interval for the mean is given by
\[ \overline{Y} - t_{0.05(2),df}\mathrm{SE}_{\overline{Y}} < \mu < \overline{Y} + t_{0.05(2),df}\mathrm{SE}_{\overline{Y}}. \]
This comes from \[ -t_{0.05(2),df} < t_{df} = \frac{\overline{Y}-\mu}{\mathrm{SE}_{\overline{Y}}} < t_{0.05(2),df} \]
Chapter 11, Practice Problem #1
Consider the changes in highest elevation for 31 taxa, in meters, over the late 1900s and early 2000s. Positive and negative numbers will indicate upward and downward shifts in elevation, respectively.
str(myData)
'data.frame': 31 obs. of 2 variables:
$ elevationalRangeShift: num 58.9 7.8 108.6 44.8 11.1 ...
$ taxonAndLocation : Factor w/ 30 levels "aquatic bugs_UK",..: 21 7 8 9 6 1 10 12 13 14 ...
elevation <- myData$elevationalRangeShift
summary(elevation)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-19.30 15.95 35.80 39.33 63.45 108.60
Some statistics
n <- length(elevation)
mu <- mean(elevation)
sdev <- sd(elevation)
sderr <- sdev/sqrt(n)
matrix(c(n, mu, sdev, sderr), nrow=1, byrow=TRUE, dimnames=list("",c("Length","Mean","Sd","Sd err")))
Length Mean Sd Sd err
31 39.32903 30.66312 5.507259
95% confidence interval
(tcrit <- qt(0.025, df=n-1, lower.tail=FALSE))
[1] 2.042272
ci <- c(mu - tcrit*sderr, mu + tcrit*sderr)
names(ci) <- c("lower bound", "upper bound")
ci
lower bound upper bound
28.08171 50.57635
99% confidence interval
(tcrit <- qt(0.01/2, df=n-1, lower.tail=FALSE))
[1] 2.749996
ci <- c(mu - tcrit*sderr, mu + tcrit*sderr)
names(ci) <- c("lower bound", "upper bound")
ci
lower bound upper bound
24.18410 54.47397
Definition: The
one-sample \( t \)-test compares the mean of a random sample from a normal population with the population mean proposed in a null hypothesis.
Null hypothesis: \( H_{0} \): The true mean equals \( \mu_{0} \) (\( \mu = \mu_{0} \))
Alternate hypothesis: \( H_{A} \): The true mean does not equal \( \mu_{0} \) (\( \mu \neq \mu_{0} \))
Test statistic:
\[
t = \frac{\overline{Y}-\mu_{0}}{\mathrm{SE}_{\overline{Y}}}
\]
Sampling distribution of \( t \) under \( H_{0} \): \( t \)-distribution with \( df = n-1 \)
One-sample \( t \)-test
\[ H_{0}: \mu = 0 \\ H_{A}: \mu \neq 0 \]
First, let's calculate a \( P \)-value:
tstat <- (mu - 0)/sderr
(pval <- 2*pt(abs(tstat), df=n-1, lower.tail=FALSE))
[1] 6.056689e-08
Conclusion: With significance level \( \alpha = 0.05 \), since \( P < 0.05 \), we reject the null hypothesis.
One-sample \( t \)-test
\[ H_{0}: \mu = 0 \\ H_{A}: \mu \neq 0 \]
Second, using critical values:
(tstat)
[1] 7.141309
(tcrit <- qt(0.025, df=n-1, lower.tail=FALSE))
[1] 2.042272
One-sample \( t \)-test
\[ H_{0}: \mu = 0 \\ H_{A}: \mu \neq 0 \]
Third, using t.test
function:
t.test(elevation, mu=0, conf.level=0.95)
One Sample t-test
data: elevation
t = 7.1413, df = 30, p-value = 6.057e-08
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
28.08171 50.57635
sample estimates:
mean of x
39.32903
In many cases, it isn't the mean that we are interested in estimating but the variability of a population measure.
Remember, variance is also a population parameter, so we should be able to estimate it.
Stalk-eyed flies have staring contests! Longer stalked flies usually win.
Definition: If \( Y \) has a normal distribution, then the sampling distribution of the quantity \[ \chi^{2} = (n-1)s^2/\sigma^2 \] is the \( \chi^2 \) distribution with \( n-1 \) degrees of freedom.
\[ \frac{df s^2}{\chi^2_{\alpha/2,df}} < \sigma^2 < \frac{df s^2}{\chi^2_{1-\alpha/2,df}} \]
myData <- read.csv("/Users/mdlama/Dropbox/Work/Teaching/College of William and Mary/Spring 2017/Datasets/chapter11/chap11e2Stalkies.csv")
eyespan <- myData$eyespan
summary(eyespan)
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.150 8.630 8.690 8.778 8.960 9.450
svar <- var(eyespan)
df <- length(eyespan) - 1
tcritL <- qchisq(0.025, df=df, lower.tail=TRUE)
tcritU <- qchisq(0.025, df=df, lower.tail=FALSE)
ci <- c(df*svar/tcritU, df*svar/tcritL)
names(ci) <- c("lower bound", "upper bound")
ci
lower bound upper bound
0.07238029 0.58225336
Note: Same assumptions as confidence interval for mean, but much less robust to deviations from these assumptions!!!