26-03-2025                       >eR-BioStat


Foundations for inference using R

Ziv Shkedy and Thi Huyen Nguyen based on Chapter 4 in the book of Julie Vu and Dave Harrington Introductory Statistics for the Life and Biomedical Sciences (https://www.openintro.org/book/biostat/)

## load/install libraries
.libPaths(c("./Rpackages",.libPaths()))
library(knitr)
library(tidyverse)
library(deSolve)
library(minpack.lm)
library(ggpubr)
library(readxl)
library(gamlss)
library(data.table)
library(grid)
library(png)
library(nlme)
library(gridExtra)
library(mvtnorm)
library(e1071)
library(lattice)
library(ggplot2)
library(dslabs)
library(NHANES)
library(plyr)
library(dplyr)
library(nasaweather)
library(ggplot2)
library(gganimate)
library(av)
library(gifski)
library(foreach)
library("DAAG")
library(DT)
library(TeachingDemos)
library(gridExtra)

1. Variability in estimates

1.2 A point estimate for the population parameter

A natural way to estimate features of the population, such as the population mean weight,is to use the corresponding summary statistic calculated from the sample. For example, the sample mean \(\bar{x}\) is a point parameter estimate for the population (unknown) mean \(\mu\) and the sample variance \(s^{2}\) is a point parameter estimate for the population variance \(\sigma^{2}\).

1.3 Example: the wind speed in the airquality dataset

Data and point estimates

The airquality dataset gives information about 153 daily air quality measurements in New York, May to September 1973.

dim(airquality)
## [1] 153   6

The first 6 lines of the dataset are shown below.

dim(airquality)
## [1] 153   6
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

The variable Wind is the average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport. The mean wind speed is \(\hat{\mu}=\bar{x}=9.95\) and the sample standard deviation is \(\hat{\sigma}=s=3.52\). This sample mean is a point estimate of the population mean. If a different random sample of 153 days were taken the new sample mean would likely be different as a result of sampling variation. Note that while estimates generally vary from one sample to another, the population mean (\(\mu\)) and variance \((\sigma^{2})\) are fixed values (the parameters in the population).

wind<-airquality$Wind
mean(wind)
## [1] 9.957516
sd(wind)
## [1] 3.523001

Visualization

Figure 1 and 2 show a histogram and boxplot of the data. Both Figures indicate a slightly skewed distribution to the right.

ggplot(airquality, aes(x = Wind)) +
  geom_histogram(aes(y = ..density..), fill = "skyblue", color = "black") +
  geom_density(alpha = 0.2, fill = "orange")+
  ylab("Density")
Histogram with density plot of wind speed.

Figure 1: Histogram with density plot of wind speed.

ggplot(airquality, aes(x = "", y = Wind)) +
  geom_boxplot(fill = "skyblue", color = "black")+
  geom_jitter(aes(x = "", y = Wind), color = "black", size = 1, alpha = 0.5)+
  xlab("")
Box plot of wind speed.

Figure 2: Box plot of wind speed.

#ggplot(airquality, aes(x = "", y = Wind)) +
#  geom_violin(fill = "lightblue")+
#  xlab("")
#ggplot(airquality, aes(x = "", y = Wind)) +
#  geom_violin(fill = "lightblue")+
#  geom_jitter(aes(x = "", y = Wind), color = "black", size = 1, alpha = #0.5)+
#  xlab("")

1.4 The variability of the sample mean

Point estimates become more accurate with increasing sample size. Figure 3 shows the sample mean weight calculated for random samples drawn from \(N(0,1)\), where sample size increases by 1 for each draw until sample size equals 500. The red dashed horizontal line in the figure is drawn at zero, the population mean. Note how a sample size around 50 may produce a sample mean that is relativly higher or lower than the population mean. As sample size increases, the fluctuations around the population mean decrease; in other words, as sample size increases, the sample mean becomes less variable and provides a more reliable estimate of the population mean.

ni<-seq(from=1,to=1000,by=1)
m.x<-c(1:length(ni))
for(i in 1:length(ni))
{
xsample<-rnorm(ni[i],0,1)
m.x[i]<-mean(xsample)
}
plot(ni,m.x,type="l")
abline(0,0,col=2)
Sample means of random samples from $N(0,1)$.

Figure 3: Sample means of random samples from \(N(0,1)\).

1.5 The sampling distribution for the mean

The sample mean of the wind speed is 9.95 pmh. Another random sample of 153 days might produce a different value of \(\bar{x}\). Repeated random sampling could result in additional different values. Each sample mean \(\bar{x}\) can be thought of as a single observation from a random variable X. The distribution of X is called the sampling distribution of the sample mean, and has its own mean and standard deviation. The concept of a sampling distribution can be illustrated by taking repeated random samples from \(N(0,1)\). Figure 4 shows a histogram of sample means from 1,000 random samples of size 50 from \(N(0,1)\). The histogram provides an approximation of the theoretical sampling distribution of \(\bar{X}\) for samples of size 50.

ni<-50
m.x<-c(1:1000)
for(i in 1:1000)
{
xsample<-rnorm(ni,0,1)
m.x[i]<-mean(xsample)
}
par(mfrow=c(1,1))
#plot(c(1:1000),m.x,type="l",xlab="sample",ylab="sample mean")
#abline(0,0,col=2)
hist(m.x,nclass=50,main=" ",xlab="sample mean")
A histogram of 1000 sample means from $N(0,1)$, n = 50.

Figure 4: A histogram of 1000 sample means from \(N(0,1)\), n = 50.

1.6 The Central Limit Theorem

Since the complete sampling distribution consists of means for all possible samples of size 50, drawing a much larger number of samples provides a more accurate view of the distribution; the left panel of Figure 5 shows the distribution calculated from 100,000 sample means. A normal probability plot of these sample means is shown in the right panel of Figure 5. All of the points closely fall around a straight line, implying that the distribution of sample means is nearly normal. This result follows from the Central Limit Theorem.

ni<-50
m.x<-c(1:100000)
for(i in 1:100000)
{
xsample<-rnorm(ni,0,1)
m.x[i]<-mean(xsample)
}
par(mfrow=c(1,2))
hist(m.x,nclass=50,main=" ",xlab="sample mean")
qqnorm(m.x)
The central limit theorem. A histogram of 100000 sample means from $N(0,1)$ (left panel) and a normal probability plot (right panel), n = 50.

Figure 5: The central limit theorem. A histogram of 100000 sample means from \(N(0,1)\) (left panel) and a normal probability plot (right panel), n = 50.

The sampling distribution for the mean is unimodal and symmetric around the mean of the random variable \(\bar{X}\). Statistical theory can be used to show that the mean of the sampling distribution for \(\bar{X}\) is exactly equal to the population mean \(\mu\). However, in almost any study, conclusions about a population parameter must be drawn from the data collected from a single sample. The sampling distribution of \(\bar{X}\) is a theoretical concept, since obtaining repeated samples by conducting a study many times is not possible. In other words, it is not feasible to calculate the population mean \(\mu\) by finding the mean of the sampling distribution for \(\bar{X}\).

2. Standard error of the mean

2.1 Estimation of the variability of \(\bar{X}\)

The standard error (SE) of the sample mean measures the sample-to-sample variability of \(\bar{X}\) , the extent to which values of the repeated samplemeans oscillate around the populationmean. The error theoretical standard error of the sample mean is calculated by dividing the population standard deviation (\(\sigma_{X}\)) by the square root of the sample size \(n\). Since the population standard deviation is typically unknown, the sample standard deviation \(s\) is often used in the definition of a standard error; \(s\) is a reasonably good estimate of \(\sigma_{X}\). If \(\bar{X}\) represents the sample mean weight, its standard error (denoted by SE) is given by

\[SE_{\bar{X}}=\frac{s}{\sqrt{n}}\].

This estimate tends to be sufficiently good when the sample size is at least 30 and the population distribution is not strongly skewed. In the case of skewed distributions, a larger sample size is necessary.

2.2 Example: \(SE_(\bar{X})\) for samples from \(N(0,1)\)

Larger sample sizes produce sampling distributions that have lower variability. Increasing the sample size causes the distribution of \(\bar{X}\) to be clustered more tightly around the population mean \(\mu\), allowing for more accurate estimates of \(\mu\) from a single sample, as shown in Figure 6 that presents the histograms of 1000 samples of size 50 (upper panel) and 100 (lower panel) taken from \(N(0,1)\). Note that when sample size is large, it is more likely that any particular sample will have a mean close to the population mean

ni1<-50
ni2<-100
m.x1<-m.x2<-c(1:100000)
for(i in 1:100000)
{
xsample1<-rnorm(ni1,0,1)
xsample2<-rnorm(ni2,0,1)
m.x1[i]<-mean(xsample1)
m.x2[i]<-mean(xsample2)
}
par(mfrow=c(2,1))
hist(m.x1,nclass=50,main="n=50",xlab="sample mean",xlim=c(-0.7,0.7))
hist(m.x2,nclass=50,main="n=100",xlab="sample mean",xlim=c(-0.7,0.7))
Histograms of 100000 sample means from $N(0,1)$, n = 50,100.

Figure 6: Histograms of 100000 sample means from \(N(0,1)\), n = 50,100.

3. Confidence intervals

3.1 Interval estimates for a population parameter

While a point estimate consists of a single value, an interval estimate provides a plausible range of values for a parameter. When estimating a population mean \(\mu\), a confidence interval for \(\mu\) has the general form \[(\bar{x} -m; \bar{x} +m) = \bar{x} \pm m , \] where \(m\) is the margin of error. Intervals that have this form are called two-sided confidence intervals because they provide both lower and upper bounds, \(\bar{x} - m\) and \(\bar{x} +m\) , respectively.

3.2 Confidence interval based on a Normal distribution

The standard error of the sample mean is the standard deviation of its distribution; additionally, the distribution of sample means is nearly normal and centered at \(\mu\). Under the normal model, the sample mean \(\bar{x}\) will be within 1.96 standard errors (i.e., standard deviations) of the population mean \(\mu\) approximately 95% of the time. Thus, if an interval is constructed that spans 1.96 standard errors from the point estimate in either direction, a data analyst can be 95% confident that the interval \[\bar{x} \pm 1.96 \times SE\] contains the population mean. As shown in Figure 7, \(-1.96\) and \(1.96\) are the \(2.5\%\) and \(97.5\%\) quantiles of the standard normal distribution \(N(0,1)\).

x<-seq(from=-3,to=3,length=1000)
dx<-dnorm(x,0,1)
qx<-qnorm(0.025,0,1)
qx
## [1] -1.959964
plot(x,dx,type="l",ylab="f(x)")
lines(c(qx,qx),c(0,0.35))
lines(-c(qx,qx),c(0,0.35))
text(-1.96,0.375,"-1.96")
text(1.96,0.375,"1.96")
text(-2.5,0.005,"0.025")
text(2.5,0.005,"0.025")
text(0,0.005,"0.95")
The $2.5$ ($Z_{0.025}$) and $97.5$ ($Z_{0.975}$) quantiles of $N(0,1)$.

Figure 7: The \(2.5\) (\(Z_{0.025}\)) and \(97.5\) (\(Z_{0.975}\)) quantiles of \(N(0,1)\).

3.3 Coverage rate

The value \(95\%\) is an approximation, accurate when the sampling distribution for the sample mean is close to a normal distribution. This assumption holds when the sample size is sufficiently large. The phrase “\(95\%\) confident” has a subtle interpretation: if many samples were drawn from a population, and a confidence interval is calculated from each one using Equation 4.2, about \(95\%\) of those intervals would contain the population mean\(\mu\). Figure 8 illustrates this process with 100 confidence intervals obtained for 100 samples of size \(n=25\) taken from \(N(0,1)\).

ni1<-25
B<-100
m.x1<-sd.x<-UL<-LL<-SE<-c(1:B)
for(i in 1:B)
{
xsample<-rnorm(ni1,0,1)
m.x1[i]<-mean(xsample)
sd.x[i]<-sqrt(var(xsample))
SE[i]<-sd.x[i]/sqrt(ni1)
LL[i]<-m.x1[i]-1.96*SE[i]
UL[i]<-m.x1[i]+1.96*SE[i]
}
plot(c(1:B),m.x1,ylim=c(-1,1),xlab="Sample",ylab="95% C.I")
abline(0,0,col=2)
for(i in 1:B)
{
lines(c(i,i),c(LL[i],UL[i]))  
}  
 Confidence intervals for 100 samples (n=25) from $N(0,1)$.

Figure 8: Confidence intervals for 100 samples (n=25) from \(N(0,1)\).

3.4 Example: a \(95\%\) confidence interval for the wind speed

Calculation of the lower and upper limits

Figure 9 shows the histogram for the average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport in 1973. The mean wind speed is \(\hat{\mu}=\bar{x}=9.95\) and the sample standard deviation is \(\hat{\sigma}=s=3.52\).

wind<-airquality$Wind
M.wind<-mean(wind)
SD.wind<-sqrt(var(wind))
M.wind
## [1] 9.957516
SD.wind
## [1] 3.523001
ggplot(airquality, aes(x = Wind)) +
  geom_histogram(fill = "skyblue", color = "black")+
  ylab("Frequency")
Histogram of wind speed and confidence interval for the population mean.

Figure 9: Histogram of wind speed and confidence interval for the population mean.

The standard error for the sample mean is equal to 0.2848,

\[\frac{3.52}{\sqrt{153}}=0.2848\].

n<-length(wind)
SE<-SD.wind/sqrt(n)
SE
## [1] 0.2848178

A \(95\%\) confidence interval for the wind speed is given by

\[\bar{x}_{wind} \pm 1.96 \times SE = 9.95 \pm 1.96 \times 0.2848\],

LL<-M.wind-1.96*SE
UL<-M.wind+1.96*SE
c(LL,UL)
## [1]  9.399273 10.515759

Visualization

Let us look at Figure 10 which shows the histogram of the average wind per day and, the sample mean and the upper and lower limits of the \(95%\) confidence intervals (red lines) and sample mean (green line). Figure 11 shows the data with the \(95\%\) confidence interval.

wind<-airquality$Wind
M.wind<-mean(wind)
SD.wind<-sqrt(var(wind))
M.wind
## [1] 9.957516
SD.wind
## [1] 3.523001
hist(wind,col=0)
lines(c(LL,LL),c(0,50),col=2,lwd=2)
lines(c(UL,UL),c(0,50),col=2,lwd=2)
lines(c(M.wind,M.wind),c(0,50),col=3,lwd=2)
A histogram for the wind speed and confidence interval.

Figure 10: A histogram for the wind speed and confidence interval.

wind<-airquality$Wind
ni<-length(wind)
M.wind<-mean(wind)
plot(c(1:ni),wind)
lines(c(1,ni),c(UL,UL),col=2,lwd=2)
lines(c(1,ni),c(LL,LL),col=2,lwd=2)
lines(c(1,ni),c(M.wind,M.wind),col=3,lwd=2)
The daily average wind speed and a confidnce interval for the population mean.

Figure 11: The daily average wind speed and a confidnce interval for the population mean.

#ggplot(airquality, aes(x = Wind, y = c(1:length(airquality$Wind)))) +
#  geom_point(position = position_jitter(width = 0.05), color = "blue") +
#  ylab("")+
#  geom_vline(xintercept = LL, color = "red", linetype = "dashed")+
#  geom_vline(xintercept = UL, color = "red", linetype = "dashed")+
#  geom_text(aes(x = LL-0.5, y = min(c(1:length(airquality$Wind))-0.5) , #label = "LL"), color = "red")+
#  geom_text(aes(x = UL+0.5, y = min(c(1:length(airquality$Wind))-0.5) , #label = "UL"), color = "red")

A \(95\%\) confidence interval using the R function z.test

The R function z.test requires a knowledge about the population variance. Let us assume that the population standard deviation, \(\sigma\) is know and is equal to 3.52 (\(\sigma=3.52\)).

A general call of the function z.test() have the form of

  <tt>z.test(variable,standard deviation)</tt> 

For the variable wind we have

library(TeachingDemos)
wind=na.omit(airquality$Wind)
sigma=3.52
z.test(wind,sd=sigma)
## 
##  One Sample z-test
## 
## data:  wind
## z = 34.991, n = 153.00000, Std. Dev. = 3.52000, Std. Dev. of the sample
## mean = 0.28458, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   9.399759 10.515273
## sample estimates:
## mean of wind 
##     9.957516

3.5 Changing the confidence level

Ninety-five percent confidence intervals are the most commonly used interval estimates, but intervals with confidence levels other than \(95\%\) can also be constructed. The general formula for a confidence interval (for the population mean \(\mu\)) with confidence level of \(1-2 \times \alpha\) is given by \[\bar{x} \pm z_{\alpha} \times SE \] where \(z_{\alpha}\) is chosen according to the confidence level. When calculating a 95% confidence level, \(z_{\alpha}\) is 1.96, since, as shown in Figure 12, the area within 1.96 standard deviations of the mean captures 95% of the distribution.

x<-seq(from=-3,to=3,length=1000)
dx<-dnorm(x,0,1)
qx<-qnorm(0.025,0,1)
plot(x,dx,type="l",ylab="f(x)")
lines(c(qx,qx),c(0,0.35))
lines(-c(qx,qx),c(0,0.35))
text(-1.96,0.375,"-1.96")
text(1.96,0.375,"1.96")
text(0,0.005,"0.95")
Quantiles for  $N(0,1)$.

Figure 12: Quantiles for \(N(0,1)\).

To construct a \(99\%\) confidence interval, \(z_{\alpha}\) must be chosen such that 99% of the normal curve is captured between \(-z_{\alpha}\) and \(z_{\alpha}\).In this case, \(\alpha=0.005\) so \(1-2 \times \alpha=0.99\) and \(z_{0.005}=-2.575\) (the red line in Figure 13).

qx1<-qnorm(0.005,0,1)
qx1
## [1] -2.575829
plot(x,dx,type="l",ylab="f(x)")
lines(c(qx,qx),c(0,0.35),lty=2)
lines(c(qx1,qx1),c(0,0.35),col=2)
lines(-c(qx,qx),c(0,0.35),lty=2)
lines(-c(qx1,qx1),c(0,0.35),col=2)
text(-1.96,0.375,"-1.96")
text(1.96,0.375,"1.96")
text(-2.575,0.375,"-2.575")
text(2.575,0.375,"2.575")
text(-2.85,0.005,"0.005",cex=0.75,col=2)
text(2.85,0.005,"0.005",cex=0.75,col=2)
text(0,0.005,"0.99",cex=0.75,col=2)
The 0.005 abd 0.995 quantiles of $N(0,1)$ (the red lines).

Figure 13: The 0.005 abd 0.995 quantiles of \(N(0,1)\) (the red lines).

3.6 Example: The NHANES dataset

The NHANES dataset consists of data from the US National Health and Nutrition Examination Study. Information about 76 variables is available for 10000 individuals included in the study. The 10000 indivuduals are considered as the population. In this section we focus on the BMI (the R object BMI).

library(NHANES)
data(NHANES)
dim(NHANES)
## [1] 10000    76

The population parameters

After removing the individuals with missing values for BMI, 9634 individuals will be considered as the population. For these individuals, \(\mu=26.66\) and \(\sigma^{2}=54.41\). These are the parameters of the population.

library(NHANES)
data(NHANES)
#dim(NHANES)
#names(NHANES)
bmi<-na.omit(NHANES$BMI)
length(bmi)
## [1] 9634
mean(bmi)
## [1] 26.66014
var(bmi)
## [1] 54.41392

Visualization

Figure 14 and 15 show the histogram and violin plot for the BMI in the population. Both figures reveal a long tail in the right side of the distribution.

#ggplot(NHANES, aes(x = BMI)) +
#  geom_histogram(fill = "skyblue", color = "black")+
#  ylab("Frequency")
ggplot(NHANES, aes(x = BMI)) +
  geom_histogram(aes(y = ..density..), fill = "skyblue", color = "black") +
  geom_density(alpha = 0.2, fill = "orange")+
  ylab("Density")
Histogram with density plot for the BMI.

Figure 14: Histogram with density plot for the BMI.

ggplot(NHANES, aes(x = "", y = BMI)) +
  geom_violin(fill = "lightblue")+
  geom_jitter(aes(x = "", y = BMI), color = "black", size = 0.1, alpha = 0.5)+
  xlab("")
Violin plot for the BMI.

Figure 15: Violin plot for the BMI.

A random sample of size 50 from the data

Out of the population (of 9634 individuals) we draw a random sample of 50 individuals. The sample mean and variance are equal to \(\bar{x}=26.768\) and \(s^{2}=87.67349\), respectively.

set.seed(123456)
x.bmi<-sample(na.omit(NHANES$BMI),size=50,replace=FALSE)
length(x.bmi)
## [1] 50
mean(x.bmi)
## [1] 26.768
var(x.bmi)
## [1] 87.67349

Visualization

A histogram and a violin plot for the BMI sample are shown in Figure 16.

hist_bmi = ggplot(data.frame(BMI = x.bmi), aes(x = BMI)) +
  geom_histogram(fill = "skyblue", color = "black")+
  ylab("Frequency")+
  ggtitle("Histogram")

violin_bmi = ggplot(data.frame(BMI = x.bmi), aes(x = "", y = BMI)) +
  geom_violin(fill = "lightblue")+
  xlab("")+
  ggtitle("Violin plot")
grid.arrange(hist_bmi, violin_bmi, ncol = 2)
Histogram and violin of a random sample for the BMI (n=50).

Figure 16: Histogram and violin of a random sample for the BMI (n=50).

A \(95\%\) C.I for the mean BMI

The standard error of the sample mean is equal to

\[\frac{s}{\sqrt{n}}=\sqrt{\frac{87.67349}{50}}=1.3241.\]

n<-length(x.bmi)
SD.x<-sqrt(var(x.bmi))
SE<-SD.x/sqrt(n)
SE
## [1] 1.324186

For a \(95\%\) confidence level, \(\alpha=0.025\) and \(Z_{\alpha}=1.96\). The confidence interval is given by

\[\bar{x} \pm m = \bar{x} \pm 1.96 \times SE.\]

For the BMI sample we get

LL<-mean(x.bmi)-1.96*SE
UL<-mean(x.bmi)+1.96*SE
c(LL,UL)
## [1] 24.17259 29.36341

This implies that with a confidence of \(95\%\) we can conclude that the true mean in the population, \(\mu\), is between 24.16 to 29.36. Note that the true mean \(\mu=26.78\).

Visualization

The histogram of the sample with a \(95\%\) confidence interval is shown in Figure 17.

M.bmi<-mean(x.bmi)
hist(x.bmi,col=0,xlab="BMI",main="BMI")
lines(c(LL,LL),c(0,50),col=2,lwd=2)
lines(c(UL,UL),c(0,50),col=2,lwd=2)
lines(c(M.bmi,M.bmi),c(0,50),col=3,lwd=2)
Histogram for the BMI with a 95% confidence interval for the population mean.

Figure 17: Histogram for the BMI with a 95% confidence interval for the population mean.

4. Hypothesis testing

4.1 The Formal Approach to Hypothesis Testing

Formulating null and alternative hypotheses

In this section we focus on the wind speed in the airquality data. The claim to be tested is that the population mean of the average wind speed is equal to 9 mph. \[ H_{0} : \mu = 9\]. This implies that under the null hypothesis, the population mean \(\mu=\mu_{0}=9\). In the absence of prior evidence that the wind speed is slower or faster, reasonable to begin with an alternative hypothesis that allows for population n mean under the alternative to be in either direction. \[H_{A} : \mu \ne 9\].

Specifying a significance level, \(\alpha\)

It is important to specify how rare or unlikely an event must be in order to represent sufficient evidence against the null hypothesis. This should be done during the design phase of a study, to prevent any bias that could result from defining ’rare’ only after analyzing the results. When testing a statistical hypothesis, an investigator specifies a significance level, \(\alpha\), that defines a ’rare’ event. Typically, \(\alpha\) is chosen to be 0:05, though it may be larger or smaller, depending on context. An \(\alpha\) level of 0:05 implies that an event occurring with probability lower than \(5\%\) will be considered sufficient evidence against \(H_{0}\).

Calculating the test statistic

Calculating the test statistic \(t\) is analogous to standardizing observations with Z-scores. The test statistic quantifies the number of standard deviations between the sample mean \(\bar{x}\) and the population mean \(\mu\). If the population standard deviation is known the test statistic given by

\[t=\frac{\bar{x}-\mu_{0}}{\sigma/\sqrt{n}}. \]

If \(\sigma\) is unknown, which is typicaly the case, we use the parameter estimate from the sample, \(s\),

\[t=\frac{\bar{x}-\mu_{0}}{s/ \sqrt{n}}. \]

Calculating the \(p\)-value

The p-value is the probability of observing a sample mean as or more extreme than the observed value, under the assumption that the null hypothesis is true. In samples of size 40 or more, the t-statistic will have a standard normal distribution \(N(0,1)\) unless the data are strongly skewed or extreme outliers are present.

  • For two-sided tests, with \(H_{A} : \mu \ne \mu_{0}\), the p-value is the sum of the area of the two tails defined by the t-statistic: \[ 2 \times P(Z > |t|) . \]

  • For one-sided tests with \(H_{A}: \mu > \mu_{0}\) , the \(p\)-value is given by \(P (Z > t)\). For \(H_{A}:\mu < \mu_{0}\) , the \(p\)-value is the area to the left of the t-statistic, \(P(Z \le t)\).

Drawing a conclusion

For a significant level \(\alpha\),

  • If \(p-\mbox{value} > \alpha\), the observed sample mean is not extreme enough to warrant rejecting H0; more formally stated, there is insufficient evidence to reject H0. A high \(p\)-value suggests that the difference between the observed sample mean and $_{0} can reasonably be attributed to random chance.
  • If \(p-\mbox{value} \le \alpha\) , there is sufficient evidence to reject \(H_{0}\) and accept \(H_{A}\).

4.2 Example: wind speed in New York 1973

The sample and point estimators

Recall that for the airquality dataset \(n=153\) with sample mean and variance equal to \(\bar{x}=9.957516\) and \(s^{2}=3.523001\). Figure 9 shows the distribution of the wind speed.

wind<-airquality$Wind
M.wind<-mean(wind)
SD.wind<-sqrt(var(wind))
M.wind
## [1] 9.957516
SD.wind
## [1] 3.523001

Hypothesis testing

We wish to test the null hypothesis versus a two sided alternative, that is

\[H_{0}:\mu=9\;\;\mbox{Vs.}\;\; H_{1}:\mu \ne 9\]

Test statistic

\[ t=\frac{\bar{x}-\mu_{0}}{s/\sqrt{n}}=\frac{9.957-9}{3.523 / \sqrt{153}}=3.3619\]

\(p\)-value

The \(p\)-value is equal to \((1-P(Z \le |3.3619|)) \times 2=2 \times 0.000387=0.000774<0.05\) and therefore the null hypothesis is rejected.

1-pnorm(3.3619,0,1)
## [1] 0.0003870408
2*(1-pnorm(3.3619,0,1))
## [1] 0.0007740815

Note that the \(p\)-value is equal to twice the area above the test statistic 3.3691 in Figure 18.

x<-seq(from=-4,to=4,length=1000)
dx<-dnorm(x,0,1)
qx<-qnorm(0.025,0,1)
plot(x,dx,type="l",xlim=c(-4.5,4.5))
lines(c(qx,qx),c(0,0.35))
lines(-c(qx,qx),c(0,0.35))
lines(c(3.3619,3.2619),c(0,0.35),col=2,lwd=2)
text(3.3619,0.375,"t=3.3619",col=2)
text(1.96,0.375,"1.96")
Illustraion of the p-value. Red line: the observed test statistic

Figure 18: Illustraion of the p-value. Red line: the observed test statistic

Using the R function z.test()

We assume that in the population \(\sigma=s=3.523\) and produce the follwoing results using the function z.test.

wind<-airquality$Wind
M.wind<-mean(wind)
SD.wind<-sqrt(var(wind))
#SD.wind
z.test(wind,SD.wind,mu=9)
## 
##  One Sample z-test
## 
## data:  wind
## z = 3.3619, n = 153.00000, Std. Dev. = 3.52300, Std. Dev. of the sample
## mean = 0.28482, p-value = 0.0007742
## alternative hypothesis: true mean is not equal to 9
## 95 percent confidence interval:
##   9.399284 10.515749
## sample estimates:
## mean of wind 
##     9.957516

4.3 Example: the NHANES data set analysis of the number of sleep hours per night

The population

In this section, the variable of interest is the number of sleeping hours per night (the variable SleepHrsNight). Information about the number of sleeping hours per night is available for 7755 individuals (i.e., the population). The population mean and variance are \(\mu=6.927\) and \(\sigma^{2}=1.813\), respectively.

library(NHANES)
data(NHANES)
#dim(NHANES)
sleep<-na.omit(NHANES$SleepHrsNight)
length(sleep)
## [1] 7755
mean(sleep)
## [1] 6.927531
var(sleep)
## [1] 1.81368

Visualization

Figure 19, 20 and 21 show the histogram, dotplot and boxplot of the number of sleeping hours in the population and reveal a discrete distribution as expected. Note that the number of sleeping hours per night is a count variable.

ggplot(NHANES, aes(x = SleepHrsNight)) +
  geom_histogram(fill = "skyblue", color = "black")+
  ylab("Frequency")+
  xlab("Sleep hours per night")
Histogram of sleep hours per night.

Figure 19: Histogram of sleep hours per night.

 ggplot(NHANES, aes(x = c(1:length(NHANES$SleepHrsNight)), y = SleepHrsNight)) +
   geom_point(position = position_jitter(height = 0.05), color = "blue") +
   xlab("Subject")+
   ylab("Sleep hours per night")
Dot plot of sleep hours per night.

Figure 20: Dot plot of sleep hours per night.

ggplot(NHANES, aes(x = "", y = SleepHrsNight)) +
  geom_boxplot(fill = "skyblue", color = "black")+
  geom_jitter(aes(x = "", y = SleepHrsNight), color = "black", size = 0.1, alpha = 0.5)+
  ylab("Sleep hours per night")+
  xlab("")
Boxplot of sleep hours per night.

Figure 21: Boxplot of sleep hours per night.

A random sample of size 150 from the population

We draw a sample of 150 indivuduals from the population (\(n=150\)). The point estimates from the sample are \(\bar{x}=6.8466\) and \(\sigma^{2}=1.8622\).

set.seed(456789)
x.sleep<-sample(na.omit(NHANES$SleepHrsNight),size=150,replace=FALSE)
length(x.sleep)
## [1] 150
mean(x.sleep)
## [1] 6.846667
var(x.sleep)
## [1] 1.862237

Visualization

Figure 22 shows the histogram (left panel) and boxplot (right panel) of the sample.

hist_sleep = ggplot(data.frame(SleepHrsNight = x.sleep), aes(x = SleepHrsNight)) +
  geom_histogram(fill = "skyblue", color = "black")+
  ylab("Frequency")+
  xlab("Sleep hours per night")+
  ggtitle("Histogram")
box_sleep = ggplot(data.frame(SleepHrsNight = x.sleep), aes(x = "", y = SleepHrsNight)) +
  geom_boxplot(fill = "lightblue")+
  xlab("")+
  ylab("Sleep hours per night")+
  ggtitle("Box plot")
grid.arrange(hist_sleep, box_sleep, ncol = 2)
Histogram and box plot of sleep hours per night in the sample.

Figure 22: Histogram and box plot of sleep hours per night in the sample.

A \(95\%\) C.I for the mean sleep hours per night

The sample standard deveation and the standard error of the sample mean are equal to 1.3646 and 0.1114, respectivly.

n<-length(x.sleep)
SD.x<-sqrt(var(x.sleep))
SD.x
## [1] 1.364638
SE<-SD.x/sqrt(n)
SE
## [1] 0.1114222

For the sample, the error margin for a \(95\%\) confidence interval is \(m=1.96 \times SE=1.96 \times 0.1114\) and the confidence interval is given by \[\bar{x} \pm m=6.8466 \pm 0.2183 = (6.628279,7.065054). \]

LL<-mean(x.sleep)-1.96*SE
UL<-mean(x.sleep)+1.96*SE
c(LL,UL)
## [1] 6.628279 7.065054

Visualization

Figure 23 shows a histogram of the BMI with a \(95\%\) confidence interval for the population mean.

hist_sleep = ggplot(data.frame(SleepHrsNight = x.sleep), aes(x = SleepHrsNight)) +
  geom_histogram(fill = "skyblue", color = "black")+
  geom_vline(xintercept = LL, color = "red", linetype = "dashed")+
  geom_vline(xintercept = UL, color = "red", linetype = "dashed")+
  geom_text(aes(x = LL-0.25, y = 42.5, label = "LL"), color = "red")+
  geom_text(aes(x = UL+0.25, y = 42.5, label = "UL"), color = "red")+
  ylab("Frequency")+
  xlab("Sleep hours per night")+
  ggtitle("Histogram and C.I")
hist_sleep
Histogram with a confidence interval for the population mean.

Figure 23: Histogram with a confidence interval for the population mean.

Hypothesis testing

We wish to test the null hypothesis \(\mu=7\) aginst a one sided alternative \(H_{1}:\mu <7\). This can be done using the argument alternative = ‘less’ in the function z.test. Note that we assume that in the population, \(\sigma=1.3646\). As can be seen in the panel below, for the sample, the mean number of sleeping hours is equal to \(\bar{x}=6.8466\) and the test statistic is equal to \(-1.3761\). The \(p\)-=0.08439 > 0.05$. We cannot reject the null hypothesis and conclude that \(\mu=7\).

mean(x.sleep)
## [1] 6.846667
sqrt(var(x.sleep))
## [1] 1.364638
z.test(x.sleep,mu=7, 1.364638, alternative = 'less')
## 
##  One Sample z-test
## 
## data:  x.sleep
## z = -1.3761, n = 150.00000, Std. Dev. = 1.36464, Std. Dev. of the
## sample mean = 0.11142, p-value = 0.08439
## alternative hypothesis: true mean is less than 7
## 95 percent confidence interval:
##     -Inf 7.02994
## sample estimates:
## mean of x.sleep 
##        6.846667

5 Hypothesis testing and confidence intervals

5.1 Two sided alternatives and confidence intervals

The relationship between a hypothesis test and the corresponding confidence interval is defined by the significance level \(\alpha\); the two approaches are based on the same inferential logic, and differ only in perspective. The hypothesis testing approach asks whether \(\bar{x}\) is far enough away from \(\mu_{0}\) to be considered extreme, while the confidence interval approach asks whether \(\mu_{0}\) is close enough to \(\bar{x}\) to be plausible. In both cases, “far enough” and “close enough” are defined by \(\alpha\), which determines the value of \(z_{\alpha}\) that is used to calculate the margin of error \(m = z_{\alpha} \times SE\).

  • Hypothesis Test. For a two-sided test, \(\bar{x}\) needs to be at l east \(m\) units away from \(\mu_{0}\) in either direction to be considered extreme. The t-points marking off the
    rejection region are equal to the \(z_\alpha\) value used in the confidence interval, with the positive andnegative t-points accounting for the \(\pm\) structure in the confidence interval.
  • Confidence Interval. The plausible range of values for \(\mu_{0}\) around \(\bar{x}\) is defined as \((\bar{x} - m; \bar{x} +m)\). If \(\mu_{0}\) is plausible, it can at most be m units away in either direction from \(\bar{x}\). If the interval does not contain \(\mu_{0}\), then \(\mu_{0}\) is implausible according to and there is sufficient evidence to reject \(H_{0}\).

5.2 Example: the NHANES dataset - analysis of the total cholestrol level

Population

In this section we focus on the total cholesterol level (the variable TotChol). After omitting all individuals with missing values, 8474 individuals are included in the analysis. The population parameters are equal to \(\mu=4.879\) and \(\sigma=1.075\).

library(NHANES)
data(NHANES)
dim(NHANES)
## [1] 10000    76
totchol<-na.omit(NHANES$TotChol)
length(totchol)
## [1] 8474
mean(totchol)
## [1] 4.87922
sqrt(var(totchol))
## [1] 1.075583

Visualization

Figure 24 and 25 present the data and indicate that the population distribution has a long right tail.

ggplot(NHANES, aes(x = TotChol)) +
  geom_histogram(aes(y = ..density..), fill = "skyblue", color = "black") +
  geom_density(alpha = 0.2, fill = "orange")+
  ylab("Density")+
  xlab("The total cholestrol level")
Histogram with density plot of the total cholestrol level.

Figure 24: Histogram with density plot of the total cholestrol level.

box_cho<-ggplot(NHANES, aes(x = "", y = TotChol)) +
  geom_boxplot(fill = "skyblue", color = "black")+
  ylab("The total cholestrol level")+
  geom_jitter(aes(x = "", y = TotChol), color = "black", size = 0.1, alpha = 0.5)+
  xlab("")

violin_cho<-ggplot(NHANES, aes(x = "", y = TotChol)) +
  geom_violin(fill = "lightblue")+
  geom_jitter(aes(x = "", y = TotChol), color = "black", size = 0.1, alpha = 0.5)+
  xlab("")+
  ylab("The total cholestrol level")

grid.arrange(box_cho,violin_cho, ncol = 2)
Boxplot and violin plot of the total cholestrol level.

Figure 25: Boxplot and violin plot of the total cholestrol level.

Sample of size 100 from the population

For a sample of 100 individuals, shown in Figure 26, \(\bar{x}=5.0041\) and \(s=1.20723\). This implies that the standard error of the sample means is equal to \[\frac{1.20723}{10}=0.120723.\].

set.seed(101112)
x.totchol<-sample(na.omit(NHANES$TotChol),size=100,replace=FALSE)
length(x.totchol)
## [1] 100
mean(x.totchol)
## [1] 5.0041
sqrt(var(x.totchol))
## [1] 1.20723
ggplot(data.frame(TotChol = x.totchol), aes(x = TotChol)) +
  geom_histogram(fill = "skyblue", color = "black")+
  ylab("Frequency")+
  xlab("The total cholestrol level")
Histogram of the total cholestrol level in the sample.

Figure 26: Histogram of the total cholestrol level in the sample.

Visualization

The distribution of the total cholesterol in the sample is shown in Figure 27.

box_totchol = ggplot(data.frame(TotChol = x.totchol), aes(x = "", y = TotChol)) +
  geom_boxplot(fill = "lightblue")+
  geom_jitter(aes(x = "", y = TotChol), color = "black", size = 0.5, alpha = 0.5)+
  xlab("")+
  ylab("The total cholestrol level")+
  ggtitle("Box plot")

violin_totchol = ggplot(data.frame(TotChol = x.totchol), aes(x = "", y = TotChol)) +
  geom_violin(fill = "lightblue")+
  geom_jitter(aes(x = "", y = TotChol), color = "black", size = 0.5, alpha = 0.5)+
  xlab("")+
  ylab("The total cholestrol level")+
  ggtitle("Violin plot")

grid.arrange(box_totchol, violin_totchol, ncol = 2)
Box plot and violin plot of the total cholestrol level in the sample.

Figure 27: Box plot and violin plot of the total cholestrol level in the sample.

A \(95\%\) confidence intervals for the mean cholestrol level in the population

In the sample, \(\bar{x}=5.0041\). A \(95\%\) confidence interval (\(\alpha=0.025\), \((1-2\alpha) \times 100 =95\)) for the population mean is given by

\[5.0041 \pm 1.96 \times 0.120723=(4.767487,5.240713).\]

Testing the null hypothesis \(H_{0}:\mu=5.1\) versus a two sided alternative

We wish to test the hypothesis \(H_{0}:\mu=5.1\) versus \(H_{1}:\mu \ne 5.1\). It is assumed that \(\sigma=1.075\). With test statistic equal to 0.50326 and \(p\)-value equal to 0.68, we do not reject the null hypothesis and conclude that \(\mu=5.1\). As we showed above, the value of 5.1 is covered by the \(95\%\) confidence intervals (4.767487,5.240713)).

z.test(x.totchol,mu=5.1, 1.20723)
## 
##  One Sample z-test
## 
## data:  x.totchol
## z = -0.79438, n = 100.00000, Std. Dev. = 1.20723, Std. Dev. of the
## sample mean = 0.12072, p-value = 0.427
## alternative hypothesis: true mean is not equal to 5.1
## 95 percent confidence interval:
##  4.767487 5.240713
## sample estimates:
## mean of x.totchol 
##            5.0041

Figure 28 shows a histogram of the data with a \(95\%\) confidence interval (red lines), the sample mean (blue line) and the value of \(\mu_{0}\) (green line).

hist(x.totchol,col=0,main="total cholesterol",xlab="total cholesterol")
LL<-4.767487
UL<-5.240713
mu0<-5.1
xbar<-5.0041
lines(c(LL,LL),c(0,50),col=2,lwd=2)
lines(c(UL,UL),c(0,50),col=2,lwd=2)
lines(c(mu0,mu0),c(0,50),col=3,lwd=2)
lines(c(xbar,xbar),c(0,50),col=4,lwd=2)
Histogram of the total cholesterol and a 95% confidence interval for the population mean.

Figure 28: Histogram of the total cholesterol and a 95% confidence interval for the population mean.

6 Decision error (Type I and Type II error)

6.1 Type I and Type II errors

Hypothesis tests can potentially result in incorrect decisions, such as rejecting the null hypothesis when the null is actually true or not rejecting the null hypothesis when the alternative is correct.

  • Type I error: rejecting the null hypothesis when the null is true.
  • Type II error: failing to reject the null hypothesis when the alternative is true.

6.2 Significance level and power

The probability of making a Type I error is the same as the significance level \(\alpha\), that is

\[ \alpha=P(\mbox{reject}\;H_{0} \; \mbox{when}\;H_{0}\; \mbox{is true}),\]

since \(\alpha\) determines the cutoff point for rejecting the null hypothesis. For example, if \(\alpha\) is chosen to be 0.05, then there is a \(5\%\) chance of incorrectly rejecting \(H_{0}\).

The probability of making a Type II error is equal to \(\beta\), \[ \beta=P(\mbox{do not reject}\; H_{0}\; \mbox{when}\;H_{A}\; \mbox{is true}),\] The power of the test, \(1-\beta\) is the probability to make a correct decision when \(H_{A}\) is correct,

\[ 1-\beta=P(\mbox{reject}\; H_{0}\; \mbox{when}\;H_{A}\; \mbox{is true}).\]

The rate of Type I error can be reduced by lowering \(\alpha\) (e.g., to 0.01 instead of 0.05); doing so requires an observation to be more extreme to qualify as sufficient evidence against the null hypothesis. However, this inevitably raises the rate of Type II errors (and lowering the power \(1-\beta\)), since the test will now have a higher chance of failing to reject the null hypothesis when the alternative is true.