26-02-2025 >eR-BioStat
Inference for numerical data
Ziv Shkedy and Thi Huyen Nguyen based on Chapter 5 in the book of Julie Vu and Dave Harrington Introductory Statistics for the Life and Biomedical Sciences (https://www.openintro.org/book/biostat/)
## load/install libraries
.libPaths(c("./Rpackages",.libPaths()))
library(knitr)
library(tidyverse)
library(deSolve)
library(minpack.lm)
library(ggpubr)
library(readxl)
library(gamlss)
library(data.table)
library(grid)
library(png)
library(nlme)
library(gridExtra)
library(mvtnorm)
library(e1071)
library(lattice)
library(ggplot2)
library(dslabs)
library(NHANES)
library(plyr)
library(dplyr)
library(nasaweather)
library(ggplot2)
library(gganimate)
library(av)
library(gifski)
library(foreach)
library("DAAG")
library(DT)
library(TeachingDemos)
library(gridExtra)
The tools studied in Chapter 4 all made use of the t-statistic from a sample mean, \[t = \frac{\bar{x}-\mu}{s /\sqrt{n}}\], where the parameter \(\mu\) is a population mean, \(\bar{x}\) and \(s\) are the sample mean and standard deviation,and \(n\) is the sample size. Tests and confidence intervals were restricted to samples of at least 30 independent observations from a population where there was no evidence of strong skewness. This allowed for the Central Limit Theorem to be applied, justifying use of the normal distribution to calculate probabilities associated with the \(t\)-statistic.In sample sizes smaller than 30, if the data are approximately symmetric and there are no large outliers, the t-statistic has what is called a \(t\)-distribution. When the normal distribution is used as the sampling distribution of the \(t\)-statistic, \(s\) is essentially being treated as a good replacement for the unknown population standard deviation \(\sigma\). However, the sample standard deviation \(s\), as an estimate of \(\sigma\), has its own inherent variability like \(\bar{x}\). The \(t\) density function adjusts for the variability in s by having more probability in the left and right tails than the normal distribution
Figure 5.1 shows a \(t\)-distribution (black line) and a standard normal distribution (red line). Like the standard normal distribution, the \(t\)-distribution is unimodal and symmetric about zero. However, the tails of a \(t\)-distribution are thicker than for the normal, so observations are more likely to fall beyond two standard deviations from the mean than under the normal distribution. While the estimate of the standard error will be less accurate with smaller sample sizes, the thick tails of the \(t\)-distribution correct for the variability in \(s\).
x<-seq(from=-4,to=4,length=10000)
d.normal<-dnorm(x,0,1)
d.t<-dt(x,3)
plot(x,d.t,type="l",ylim=c(0,0.5))
lines(x,d.normal,col=2)
Figure 1: Black line: t-distribution with 3 d.f. Red line: N(0,1).
The \(t\)-distribution can be described as a family of symmetric distributions with a single parameter: degrees of freedom, which equals \(n-1\). Several \(t\)-distributions are shown in Figure 5.2. When there are more degrees of freedom, the \(t\)-distribution looks very much like the standard normal distribution. With degrees of freedom of 30 or more, the t-distribution is nearly indistinguishable from the normal distribution. Since the \(t\)-statistics in Chapter 4 were associated with sample sizes of at least 30, the degrees of freedom for the corresponding \(t\)-distributions were large enough to justify use of the normal distribution to calculate probabilities.
x<-seq(from=-4,to=4,length=10000)
d.normal<-dnorm(x,0,1)
d.t1<-dt(x,1)
d.t5<-dt(x,5)
d.t10<-dt(x,10)
d.t50<-dt(x,50)
plot(x,d.t1,type="l",ylim=c(0,0.5))
lines(x,d.t1,col=1)
lines(x,d.t5,col=4)
lines(x,d.t10,col=5)
lines(x,d.t50,col=6)
lines(x,d.normal,col=2)
legend(-4,0.5,c("t_(1)","t_(5)","t_(10)","t_(50)","N(0,1)"),lty=c(1,1,1,1,1),
col=c(1,4,5,6,2))
Figure 2: t-distribution with different d.f. and N(0,1).
The degrees of freedom characterize the shape of the t-distribution. The larger the degrees of freedom, the more closely the distribution approximates the normal model.
Probabilities for the t-distribution can be calculated either by using distribution tables or using statistical software. In this book we focus on R. The value that identifies the cutoff for an upper tail of \(5\%\) can be found using the R function qt. For a \(t\) distribution with 18 d.f we use qt(0.95,18). This cutoff is 1.73.
df<-18
q.95<-qt(0.95,18)
q.95
## [1] 1.734064
The cutoff for the lower \(5\%\) is -1.73; just like the normal distribution, all \(t\)-distributions are symmetric. If the area in each tail is \(5\%\), then the area in two tails, as illusrated in Figure 3, is \(10\%\).
x<-seq(from=-4,to=4,length=10000)
q.95<-qt(0.95,18)
q.05<-qt(0.05,18)
d.t18<-dt(x,18)
plot(x,d.t18,type="l",ylim=c(0,0.5),ylab="f(x)")
lines(c(q.95,q.95),c(0,0.5),col=2)
lines(c(q.05,q.05),c(0,0.5),col=2)
text(2.25,0.02,"0.05")
text(-2.25,0.02,"0.05")
text(0,0.02,"0.9")
Figure 3: Quantiles in a t-distribution with d.f=18.
What proportion of the \(t\)-distribution with 18 degrees of freedom falls below -2.10 ? Just like for a normal probability problem, it is advisable to start by drawing the distribution and shading the area below -2.10, as shown in Figure 5.4. In R teh probability can be calcualted using the R function pt() in the following way:
df<-18
p.210<-pt(-2.10,18)
p.210
## [1] 0.0250452
About 2.5% of the distribution falls below -2.10.
x<-seq(from=-4,to=4,length=10000)
p.210<-pt(-2.10,18)
d.t18<-dt(x,18)
plot(x,d.t18,type="l",ylim=c(0,0.5),ylab="f(x)")
lines(c(-2.10,-2.10),c(0,0.475),col=2)
text(-2.18,0.5,"-2.10")
Figure 4: Probabilities in a t-distribution with d.f=18.
A \(t\)-distribution with 20 degrees of freedom is shown Figure 5.5. We would like to estimate the proportion of the distribution falling above 1.65 and below -1.65.
x<-seq(from=-4,to=4,length=10000)
d.t18<-dt(x,18)
plot(x,d.t18,type="l",ylim=c(0,0.5),ylab="f(x)")
lines(c(-1.65,-1.65),c(0,0.5),col=2)
lines(c(1.65,1.65),c(0,0.5),col=2)
Figure 5: Probabilities in a t-distribution with d.f=18.
In the lower tail, $P(T )=0.0581. hence, the probability that \(-1.65 \le T \le 1.65\) is equal to \(2 \times 0.0581=0.1162\).
df<-18
p.165<-pt(-1.65,18)
p.165
## [1] 0.05814186
2*p.165
## [1] 0.1162837
Chapter 4 provided formulas for tests and confidence intervals for population means in random samples large enough for the \(t\)-statistic to have a nearly normal distribution. In samples smaller than 30 from approximately symmetric distributions without large outliers, the \(t\)-statistic has a \(t\)-distribution with degrees of freedom equal to \(n-1\). Just like inference in larger samples, inference using the \(t\)-distribution also requires that the observations in the sample be independent. Random samples from very large populations always produce independent observations; in smaller populations, observations will be approximately independent as long as the size of the sample is no larger than \(10\%\) of the population.
Formulas for confidence intervals using the \(t\)-distribution are very similar to those using the normal distribution. For a sample of size \(n\) with sample mean \(\bar{x}\) and standard deviation \(s\), two-sided confidence intervals with confidence coefficient \(100(1-\alpha)\%\) have the form \[ \bar{x} \pm t^{*}_{df}\times SE, \]
where \(SE\) is the standard error of the sample mean \((s/\sqrt{n})\) and \(t^{*}_{df}\)is the point on a \(t\)-distribution with \((n-1)\) degrees of freedom and area \((1-\frac{\alpha}{2})\) to its left. For a given sample size and \(\alpha\), the value of \(t^{*}_{df}\) can be calculated using the R function qt().
The women data set gives the average heights and weights if 15 American women aged 30–39 and it is available as a data frame in R (the R object women).
dim(women)
## [1] 15 2
print(women)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
## 7 64 132
## 8 65 135
## 9 66 139
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164
In this section we focus on the women’s height (the variable height). The sample mean \(\bar{x}=65\) and the standard error of the sample mean is equal to 1.154.
womenheight=women$height
x.bar<-mean(womenheight)
x.bar
## [1] 65
se.x<-sd(womenheight)/sqrt(15)
se.x
## [1] 1.154701
Figure 6 presents a boxplot of the women’s height.
ggplot(women, aes(x = "", y = height)) +
geom_boxplot(fill = "lightblue")+
geom_jitter(aes(x = "", y = height), color = "black", size = 1, alpha = 0.5)+
xlab("")
Figure 6: Boxplot for the height in the women daatset.
Since \(n=15\), the critical value, which is the \(97.5\%\) quantile of \(t_{(14)}\) (shown in Figure 7), is equal to 2.1447.
tval<-qt(0.975,14)
xx<-seq(from=-4,to=4,length=10000)
dx<-dt(xx,14)
plot(xx,dx,type="l",ylab="f(x)")
lines(c(tval,tval),c(0,0.35),col=2,lwd=2)
text(tval,0.375, "2.144")
Figure 7: t-distribution with df=14 and the critical value.
A \(95\%\) confidence interval for the population mean is given by
\[ \bar{x} \pm t^{*}_{14} \times SE = 65 \pm 2.1447 \times 1.1547. \]
For the the women data the confidence interval is \((62.52341-67.47659 )\).
L<-x.bar-tval*se.x
U<-x.bar+tval*se.x
c(L,U)
## [1] 62.52341 67.47659
Note that the confidence interval is symetric around \(\bar{x}\) and we can conclude that, with a confidence of \(95\%\), the true mean of the women’ height is between 62.52 to 67.47.
For a population with mean \(\mu\), we formulate the folwoing null hypothesis \(H_{0}: \mu=\mu_{0}\) versus a one sided alternative \(H_{1}:\mu > \mu_{0}\). Similar to Chapter 4, the test statistics is \[ t=\frac{\bar{x}-\mu_{0}}{s/\sqrt{n}}. \] For large sample size, \(n>30\) or for a variable that follow a normal distribution, d under \(H_{0}\), the test statistics follows a \(t\)-distribution with \(n-1\) dgrees of freedom, \(t \sim t_{(n-1)}\). The null hypothesis is rejected if \(t >t^{*}_{df}\).
Suppose that we wish to test the null hypothesis \(H_{0}:\mu=64\) versus the alternative \(H_{1}:\mu > 64\). The test statistic for this case is \[ t=\frac{\bar{x}-\mu_{0}}{s/ \sqrt{n}}=0.8660254 \]
mu0<-64
test.stat<-(x.bar-mu0)/se.x
test.stat
## [1] 0.8660254
For one sided test and \(\alpha=0.05\) the critival value is equal to 1.76131. Since \(t=0.8660254<1.7613\) we cannot reject \(H_{0}\). The p-value for this example is 0.2 (the area in in Figure 8 to the right of the test statistic).
tval<-qt(0.95,14)
plot(xx,dx,type="l",xlab="x",ylab="f(x)")
lines(c(tval,tval),c(0,0.35),col=2,lwd=2)
lines(c(test.stat,test.stat),c(0,0.35),col=3,lwd=2)
text(c(tval,0.375),paste(round(tval,digits=3)))
Figure 8: t-distribution with df=14 and the critical value for one sided test (red line) and the observed statistic (green line).
pval<-1-pt(test.stat,14)
pval
## [1] 0.2005372
The analysis presented above can be conducted using the R function t.test(). The argument mu=64 is used to specifiy the value of \(\mu_{0}=64\).
t.test(womenheight,mu=64,conf.level=0.95)
##
## One Sample t-test
##
## data: womenheight
## t = 0.86603, df = 14, p-value = 0.4011
## alternative hypothesis: true mean is not equal to 64
## 95 percent confidence interval:
## 62.52341 67.47659
## sample estimates:
## mean of x
## 65
Note that the output above was produce for the alternative \(H_{1}:\mu \ne 64\). For a one sided test we specify alternative=c(“greater”). In this case, the confidence interval will not be estimated.
t.test(womenheight,mu=64,conf.level=0.95,alternative=c("greater"))
##
## One Sample t-test
##
## data: womenheight
## t = 0.86603, df = 14, p-value = 0.2005
## alternative hypothesis: true mean is greater than 64
## 95 percent confidence interval:
## 62.96621 Inf
## sample estimates:
## mean of x
## 65
The airquality dataset is a R object gives information about 153 daily air quality measurements (\(n=153\)) in New York, May to September 1973.
dim(airquality)
## [1] 153 6
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
The variable of primary interest, Wind, is the average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport. We use the R package ggplot2 to explore the data. Figure 9 shows histogram and density estimate of the wind speed.
ggplot(airquality, aes(x = Wind)) +
geom_histogram(aes(y=..density..),fill = "skyblue", color = "black")+
geom_density(alpha=0.3,fill="orange")+
ylab("Frequency")
Figure 9: Histogram of wind speed.
The sample mean (\(\bar{x}\)) and standard deviation (\(s\)) of the wind speed are 9.957516 and 3.523001, respectively.
wind<-airquality$Wind
M.wind<-mean(wind)
SD.wind<-sqrt(var(wind))
M.wind
## [1] 9.957516
SD.wind
## [1] 3.523001
We constrct a \(95\%\) confidence interval for the population mean using the R fucntion t.test.
wind<-airquality$Wind
t.test(wind)
##
## One Sample t-test
##
## data: wind
## t = 34.961, df = 152, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 9.394804 10.520229
## sample estimates:
## mean of x
## 9.957516
A \(95\%\) confidence interval for the wind speed is [9.394804 10.520229]. This means that we are \(95\%\) confident that the true average wind speed lies within this range.
Testing the hypotheses whether the wind speed is equal to 9 versus a one-sided alternative hypothesis at the significant level of 0.05 can be formulated by:
\[H_{0}:\mu=9\;\;\mbox{Vs.}\;\; H_{1}:\mu > 9.\]
We use the t.test() function and specify mu=9 and alternative=c(“greater”).
t.test(wind,mu=9,alternative=c("greater"))
##
## One Sample t-test
##
## data: wind
## t = 3.3619, df = 152, p-value = 0.0004897
## alternative hypothesis: true mean is greater than 9
## 95 percent confidence interval:
## 9.48616 Inf
## sample estimates:
## mean of x
## 9.957516
Since p-value = 0.0004897 which is smaller than \(\alpha = 0.05\), there is sufficient evidence to say that the mean of the wind speed is not equal to 9.
In the 2000 Olympics, was the use of a new wetsuit design responsible for an observed increase in swim velocities? In a study designed to investigate this question, twelve competitive swimmers swam 1500 meters at maximal speed, once wearing a wetsuit and once wearing a regular swimsuit.5 The order of wetsuit versus swimsuit was randomized for each of the 12 swimmers. The panel below presents the average velocity recorded for each swimmer, measured in meters per second (m/s).
swimmer<-c(1:12)
w.suit.vel<-c(1.57,1.47,1.42,1.35,1.22,1.75,1.64,1.57,1.56,1.53,1.49,1.51)
s.suit.vel<-c(1.49,1.37,1.35,1.27,1.12,1.64,1.59,1.52,1.50,1.45,1.44,1.41)
diff<-w.suit.vel-s.suit.vel
data.frame(swimmer,w.suit.vel,s.suit.vel)
## swimmer w.suit.vel s.suit.vel
## 1 1 1.57 1.49
## 2 2 1.47 1.37
## 3 3 1.42 1.35
## 4 4 1.35 1.27
## 5 5 1.22 1.12
## 6 6 1.75 1.64
## 7 7 1.64 1.59
## 8 8 1.57 1.52
## 9 9 1.56 1.50
## 10 10 1.53 1.45
## 11 11 1.49 1.44
## 12 12 1.51 1.41
The swimsuit velocity data are an example of paired data, in which two sets of observations are uniquely paired so that an observation in one set matches an observation in the other; in this case, each swimmer has two measured velocities, one with a wetsuit (the variable w.suit.vel) and one with a swimsuit (the variable s.suit.vel). A natural measure of the effect of the wet suit on swim velocity is the difference between the measured maximum velocities (velocity.diff = wet.suit.velocity - swim.suit.velocity). Even though there are two measurements per swimmer, using the difference in velocities as the variable of interest allows for the problem to be approached like those in Section XXX. Although it was not explicitly noted, the data used in Section XXX were paired; each respondent had both an actual and desired weight. Figure 10 shows the difference between the measured maximum velocities by swimmer. The sample mean for the difference is equal to 0.0775 (the green line in Figure 10).
plot(swimmer,diff,ylim=c(-0.05,0.11))
abline(0,0,col=2)
lines(c(0,13),c(mean(diff),mean(diff)),lwd=2,col=3)
Figure 10: A dot plot of differences in swim velocities.
We difine two variables, \(X_{1i}\) is the velocity with a wetsuit and \(X_{2i}\) is the velocity with a swimsuit. Let \(D_{i}\) be the difference between the velocities, \[ D_{i}=X_{1i}-X_{2i}. \] Suppose the parameter \(\delta\) is the population average of the difference in maximum velocities during a 1500m swim if all competitive swimmers recorded swim velocities with each suit type, that is \[ \delta=E(D_{i})=E(X_{1i}-X_{2i}). \]
The parameter \(\delta\) is the population mean while \(\bar{d}\), the sample mean of the difference is the parameter estimate.
A hypothesis test can then be conducted with the null hypothesis that the mean population difference in swim velocities between suit types equals 0 (i.e., there is no difference in population average swim velocities), \(H_{0}:\delta=0\) against the alternative that the difference is non-zero, \(H_{1}:\delta \ne 0\). Some important assumptions are being made. First, it is assumed that the data are a random sample from the population. While the observations are likely independent, it is more difficult to justify that this sample of 12 swimmers is randomly drawn from the entire population of competitive swimmers. Nevertheless, it is often assumed in problems such as these that the participants are reasonably representative of competitive swimmers. Second, it is assumed that the populationof differences is normally distributed. This is a small sample, one in which normality would be difficult to confirm. The dot plot for the difference in velocities in Figure 10 shows approximate symmetry around the sample mean (the green line).
Let \(\bar{d}\) denote the sample average of the differences in maximum velocity, \(s_{d}\) the sample standard deviation of the differences, and \(n\) the number of pairs in the dataset.
swimmer<-c(1:12)
diff<-w.suit.vel-s.suit.vel
mean(diff)
## [1] 0.0775
sqrt(var(diff))
## [1] 0.02179449
We consider the following test statistic to test the null hypothesis \(H_{0}:\delta=0\):
\[ t=\frac{\bar{d}-\delta_{0}}{s_{d}/\sqrt{n}}. \] In our example
\[ t=\frac{0:078-0}{0.222/\sqrt{12}}=12.32. \]
In R we can calculate the test statistics in the following way
n<-length(swimmer)
diff<-w.suit.vel-s.suit.vel
dbar<-mean(diff)
s.d<-sqrt(var(diff))
t.d<-(dbar-0)/(s.d/sqrt(n))
t.d
## [1] 12.31815
Under \(H_{0}\), the test statistic follows a \(t_{12-1}\), the two sided p value is given by
p.value<-2*(1-pt(12.32,11))
p.value
## [1] 8.87167e-08
The data support the claim that the wetsuits changed swim velocity in a 1500m swim. The observed average increase of 0.078 m/s is significantly different than the null hypothesis of no change, and suggests that swim velocities are higher when swimmers wear wetsuits as opposed to swimsuits
We can conducte the paired t-test discussed above using the R function t.text. A general call of the function, for a paired t test, has the form t.teste(x1,x2,paired = TRUE). For the 2000 olympics data we have
w.suit.vel<-c(1.57,1.47,1.42,1.35,1.22,1.75,1.64,1.57,1.56,1.53,1.49,1.51)
s.suit.vel<-c(1.49,1.37,1.35,1.27,1.12,1.64,1.59,1.52,1.50,1.45,1.44,1.41)
t.test(w.suit.vel,s.suit.vel,
alternative = c("two.sided"),
mu = 0, paired = TRUE, var.equal = FALSE,
conf.level = 0.95)
##
## Paired t-test
##
## data: w.suit.vel and s.suit.vel
## t = 12.318, df = 11, p-value = 8.885e-08
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.06365244 0.09134756
## sample estimates:
## mean difference
## 0.0775
Calculating confidence intervals for paired data is also based on the differences between the values in each pair; the same approach as for single-sample data can be applied on the differences. For example, a two-sided 95% confidence interval for paired data has the form:
\[ \left ( \bar{d}-t^{*}_{df}\times s_{d},\bar{d}+t^{*}_{df}\times s_{d} \right ), \]
where \(t^{*}_{df}\) is the point on a \(t\)-distribution with \(df = n-1\) for n pairs, with area 0.025 to its right (for a \(95\%\) confidence level). In R, a \(95\%\) confidence interval can be calculated using the argument conf.level = 0.95 in the function t.test().
t.test(w.suit.vel,s.suit.vel,
alternative = c("two.sided"),
mu = 0, paired = TRUE, var.equal = FALSE,
conf.level = 0.95)
##
## Paired t-test
##
## data: w.suit.vel and s.suit.vel
## t = 12.318, df = 11, p-value = 8.885e-08
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.06365244 0.09134756
## sample estimates:
## mean difference
## 0.0775
We consider an experiment in which we wish to draw inference about the difference in two population means, \(\mu_{1}\) and \(\mu_{2}\), when the data are independent, i.e., not paired. The point estimate of the difference, \(\bar{x}_{1}-\bar{x}_{2}\), is used to calculate a \(t\)-statistic that is the basis of confidence intervals and tests.
A confidence interval for a difference of two means has the same basic structure as previously discussed confidence intervals:
\[ (\bar{x}_{1}-\bar{x}_{2}) \pm t^{*}_{df} \times SE. \]
Here, \(SE\) id the standard error of \((\bar{x}_{1}-\bar{x}_{2})\). Since \(\sigma\) is typically unknown, the standard error is estimated by using \(s\) in place of \(\sigma\), \[ SE=SE_{(\bar{x}_{1}-\bar{x}_{2})}=\sqrt{\frac{s^{2}_{1}}{n_{1}}+\frac{s^{2}_{2}}{n_{2}}}. \] \(s^{2}_{1}\) and \(s^{2}_{2}\) are the parameter estimates for \(\sigma^{2}_{1}\) and \(\sigma^{2}_{2}\), respectivly.
For the case in which \(\sigma_{1}=\sigma_{2}=\sigma\) but unknown,the standard error can be rewriten as \[ SE=SE_{(\bar{x}_{1}-\bar{x}_{2})}=\sqrt{\frac{s^{2}_{p}}{n_{1}}+\frac{s^{2}_{p}}{n_{2}}}. \] This is equal to \[ SE=SE_{(\bar{x}_{1}-\bar{x}_{2})}=s_{p}\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}. \] Here, \(s^{2}_{p}\) is the parameter estimate for the pooled variance given by
\[ s^{2}_{p}=\frac{\sum_{i=1}^{n_{1}} \left ( X_{1i}-\bar{X}_{1} \right )^{2}+\sum_{i=1}^{n_{2}} \left ( X_{2i}-\bar{X}_{2} \right )^{2}}{n_{1}+n_{2}-2}=\frac{(n_{1}-1) \times s_{1}^{2} + (n_{2}-1) \times s_{2}^{2}}{n_{1}+n_{2}-2}. \]
The chickwts gives information about ab experiment that was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. In this section we focused on the diet groups:horsebean and linseed. Let \(\mu_{H}\) and \(\mu_{L}\) be the populations mean for the horsebean and linseed, respectively.
x<-chickwts$weight[chickwts$feed=="horsebean"]
y<-chickwts$weight[chickwts$feed=="linseed"]
The parameter estimate for the populations mean are \(\bar{x}_{H}=160.2\; (n_{1}=10)\) and \(\bar{y}_{L}=218.75\;(n_{2}=12)\) for the horsebean and linseed populations, respectively.
mean(x)
## [1] 160.2
n1<-length(x)
n1
## [1] 10
mean(y)
## [1] 218.75
n2<-length(y)
n2
## [1] 12
Figure 11 show the violin of the chicks weight by diet.
chickwts1<-chickwts[1:22,]
ggplot(chickwts1, aes(x = feed, y = weight,fill=feed)) +
geom_violin()+
ylab("The chick weight")+
geom_jitter(aes(x = feed, y = weight), color = "black", size = 1, alpha = 0.5)+
xlab("")
Figure 11: A violin plot of the chicken weight by diet group.
The pool sample variance is
Sp<-(n1-1)*var(x)+(n2-1)*n2/(n1+n2-2)
Sp
## [1] 13434.2
\[ t=\frac{(\bar{x}_{1}-\bar{x}_{2})-(\mu_{1}-\mu_{2})_{H_{0}}}{s_{p}\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}}. \] Here, \((\mu_{1}-\mu_{2})_{H_{0}}\) is the mean difference in the population under the null hypothesis. In case that the null hypothesis states that \(\mu_{1}=\mu_{2}\) the term \((\mu_{1}-\mu_{2})_{H_{0}}=0\). Under the null hypothesis, \(t \sim t_{(n_{1}+n_{2}-2)}\).
For a two independent samples, a general call for the function t.test() has the form t.test(x,y,var.equal=T). The argument var.equal=T implies that we conduct the test under the assumption that \(\sigma_{1}=\sigma_{2}\). By default, the function tests the null hypothesis \(\mu_{x}=\mu_{y}\) against a two sided alternative.
t.test(x,y,var.equal=T)
##
## Two Sample t-test
##
## data: x and y
## t = -2.934, df = 20, p-value = 0.008205
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -100.17618 -16.92382
## sample estimates:
## mean of x mean of y
## 160.20 218.75
In case that the variances in the two populations are not equal we can estimate the sample means difference standard error by \[ SE=SE_{(\bar{x}_{1}-\bar{x}_{2})}=\sqrt{\frac{s^{2}_{1}}{n_{1}}+\frac{s^{2}_{2}}{n_{2}}}. \] In this case, the test statistic DOES NOT follows a \(t\)-distribution with \(n_{1}+n_{2}-2\) degrees of freedom. To conduct the test using the function t.test() we specify var.equal=F. The function will conduct a “Welch Two Sample t-test” for which the degrees of freedom are different and as a result, the test statistic avlue, the p-value and the confidence interval change compare the to two sample t-test in the previous section. Note that the degrees of freedom are equal to 19.769 compared with 10+12-2 in the previous section.
t.test(x,y,var.equal=F)
##
## Welch Two Sample t-test
##
## data: x and y
## t = -3.0172, df = 19.769, p-value = 0.006869
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -99.0597 -18.0403
## sample estimates:
## mean of x mean of y
## 160.20 218.75
The NHANES dataset consists of data from the US National Health and Nutrition Examination Study. Information about 76 variables is available for 10000 individuals included in the study. The 10000 indivuduals are considered as the population. In this section we focus on the total cholesterol level (the variable TotChol). After omitting all individuals with missing values, 8474 individuals are included in the analysis. The population parameters are equal to \(\mu=4.879\) and \(\sigma=1.075\).
library(NHANES)
data(NHANES)
dim(NHANES)
## [1] 10000 76
#names(NHANES)
totchol<-na.omit(NHANES$TotChol)
length(totchol)
## [1] 8474
mean(totchol)
## [1] 4.87922
sqrt(var(totchol))
## [1] 1.075583
Our main interest is to compare the total cholestorol level between female and make. Overall, 5020 individuals in the population (\(50.2\%\)) are female.
table(NHANES$Gender)
##
## female male
## 5020 4980
Let \(\mu_{F}\) and \(\mu_{M}\) be the mean of the total cholesterol level for female and male, respectively. Formally, we wish to test the hypotheses
\[ \begin{array}[l] HH_{0}:\mu_{F}=\mu_{M}, \\ H_{A}:\mu_{F} \ne \mu_{M}.\\ \end{array} \]
Figure 12 presents the total cholestorol level indicates that the population distribution has a long right tail.
ggplot(NHANES, aes(x = TotChol)) +
geom_histogram(aes(y = ..density..), fill = "skyblue", color = "black") +
geom_density(alpha = 0.2, fill = "orange")+
ylab("Density")+
xlab("The total cholestrol level")
Figure 12: Histogram with density plot of the total cholestrol level.
A boxplot for the total cholesterol level by gender shown in Figure 13. Our main interest in this section is to estimate the mean difference, to construct a \(95\%\) confidence interval for the the mean difference \(\mu_{F}-\mu_{M}\) and to test if \(\mu_{F}=\mu_{M}\).
ggplot(NHANES, aes(x = Gender, y = TotChol,fill=Gender)) +
geom_boxplot()+
ylab("The total cholestrol level")+
geom_jitter(aes(x = Gender, y = TotChol), color = "black", size = 0.1, alpha = 0.5)+
xlab("")
Figure 13: Box plot of the total cholestrol level by gender.
Suppose that the population parameters are unknown and it is not possible to calculate these parameters for the population. We draw a sample with 100 individuals and base our inference and estimation on the sample. Among the 100 individuals that were sampled, some have missing values for the total cholesterol level. After removing missing values, the sample size reduced to 84. The overall mean and variance are \(\bar{x}=4.890714\) and \(s^{2}=1.122269\) while the parameters in the population are equal to \(\mu=4.87922\) and \(\sigma^{2}=1.075583\). This implies that the variance of the sample mean is equal to \[\frac{1.122269}{84}.\].
set.seed(101112)
#dim(NHANES)
x.totchol<-sample(c(1:10000),size=100,replace=FALSE)
NHANES1<-NHANES[x.totchol,]
#dim(NHANES1)
#table(NHANES1$Gender)
#names(NHANES1)
#print(NHANES1$Gender)
As mentioned above, after removing missing values, the sample size reduced to 84 from which 43 are female.
Gender1<-NHANES1$Gender[!is.na(NHANES1$TotChol)]
Tot.Chol1<-na.omit(NHANES1$TotChol)
mean(Tot.Chol1)
## [1] 4.890714
var(Tot.Chol1)
## [1] 1.122269
NHANES2<-data.frame(Gender1,Tot.Chol1)
table(Gender1)
## Gender1
## female male
## 43 41
For the sample we obtained \(\bar{x}_{F}=5.106512\) and \(\bar{x}_{M}=4.6643\). Figure 14 shows the boxplot for the total cholesterol level by gender.
tapply(Tot.Chol1,Gender1,mean)
## female male
## 5.106512 4.664390
ggplot(NHANES1, aes(x = Gender, y = TotChol,fill=Gender)) +
geom_boxplot()+
ylab("The total cholestrol level")+
geom_jitter(aes(x = Gender, y = TotChol), color = "black", size = 0.7, alpha = 0.5)+
xlab("")
Figure 14: Box plot of the total cholestrol level by gender in the sample.
Figure 15 presents a density plot of the total cholestrol level by gender and reveals a clear shift of the distribution. The queation is if this shift represent a significant difference between the cholesterol level of female and male.
library(ggridges)
ggplot(NHANES1, aes(x=TotChol,y=Gender,fill = Gender)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")
Figure 15: Density plot of the total cholestrol level by gender.
For a two independet sample t-test, we use Tot.Chol1~Gender1 to specify the null hypothesis is \(H_{0}:\mu_{F}-\mu_{M}\). The continuous variable of interest is the total cholesterol level (the variable Tot.Chol1) and the factor variable is gender (the variable Gender1). Note that we assume that \(\sigma_{F}=\sigma_{M}\) and therefore we use the argument var.equal=TRUE. By default, the alternative hypothesis is a two sided alternative. The \(95\%\) confidence interval is \((-0.01037-0.89462)\) indicates that, for \(\alpha=0.05\), the null hypothsis cannot be rejected since the confidence interval covers the value of zero) and indeed, the p-value is equal to 0.05536. However, for significant level of \(10\%\), p-value=\(0.05536<0.1\) and we will reject the null hypothesis.
t.test(Tot.Chol1~Gender1,data=NHANES2,var.equal=TRUE)
##
## Two Sample t-test
##
## data: Tot.Chol1 by Gender1
## t = 1.9437, df = 82, p-value = 0.05536
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -0.01037757 0.89462034
## sample estimates:
## mean in group female mean in group male
## 5.106512 4.664390
Note that a \(90\%\) confidence interval for the mean difference, (0.06370058 -0.82054218), does not cover the value of zero indicating that, for \(\alpha=0.1\), the null hypothesis is rejected.
t.test(Tot.Chol1~Gender1,data=NHANES2,var.equal=TRUE,conf.level = 0.9)
##
## Two Sample t-test
##
## data: Tot.Chol1 by Gender1
## t = 1.9437, df = 82, p-value = 0.05536
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 90 percent confidence interval:
## 0.06370058 0.82054218
## sample estimates:
## mean in group female mean in group male
## 5.106512 4.664390