Introduction and Method
As part of the NHANES study, the triglyceride levels of 3026 adult women were measured. Triglycerides, the main constiuent of both vegetable and animal fat, have been linked to atherosclerosis, heart disease, and s troke. Let’s consider this whole group of women to population by taking a small sample, say of 25 women, from it. We will compare the distribution og triglycerides in our population and in then sample.
lipids <- read.delim("http://myweb.uiowa.edu/pbreheny/data/lipids.txt")
pop <- lipids$TRG
sam <- sample(pop, 25)
hist(pop, col="gray", border="white", breaks=seq(0, 400, length=99), main= "population distribution of triglyceride levels", xlab = "mg/dL")
summary(pop)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.0 68.0 98.0 116.9 147.0 399.0
Population distribution
The distribution of triglyceride levels for 3026 women is unimodal and skewed right. The center of the distribution is 98 mg/dL. The middle 50% of women have levels of 68 mg/dL and 147 mg/dL. Normal triglyceride levels are below 150 mg/dL, so almost 25% of the population has high levels of triglycerides.
hist(sam, col="pink", border= "white", breaks= seq(0, 400, length=10), main= "Distribution of single sample of 25 women", xlab= "mg/dL")
summary(sam)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.0 90.0 101.0 131.7 146.0 337.0
The single sample of women from the population is unimodal and skewed right. The center is at 109 mg/dL and the range is from 40-393 mg/dL. According to the outlier rule: There is at least one high outlier. In addition, there is a gap in the data, making 4 meaures in the sample seem extreme.
mean(pop)
## [1] 116.9451
mean(sam)
## [1] 131.72
The mean of the sample (108.96 mg/dL) is higher than that of the population (116.95 mg/dL)
It is worth nothing that (a) the distribution of the triglycerides in the population is clearly right skewed, (b) the sample looks representative of the population as it should because it is representative, and (c) the sample means are close, but the sample mean is clearly off a bit in terms of estimating a population mean.
This is just one sample; the means of others random samples might be much further or closer to the population mean. To see that distribution we’ll have to repeat the sampling process many times and obtain sample means.
TRGmeans <-rep(0,100)
for (i in 1:100) {
sam <- sample(pop, 25)
TRGmeans[i] <-mean(sam)
}
hist(TRGmeans, col= "gray", border = "white", breaks = seq(0,400, length = 99), main = "Distribution of 100 sample means", xlab = "sample mean nmg'dL")
summary(TRGmeans)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 90.56 106.52 114.54 116.95 127.14 157.16
The distribution of mean triglyceride levels created from 100 samples of 25 randomly-selected women in the NHANES study is approximately Normal with of 119 mg/dL.
The distribution of 100 sample means is unimodal and approximately symmetric without any outliers. Therefore, we can model the distribution with a mean at population mean of 117mg/dL and a standard error of the population standard deviation of 68, divided by the sq root of the sample size (n=25). This computation is shown in code below.
SD<-sd(pop)
SE <- SD/sqrt(25)
sd(TRGmeans)
## [1] 15.04068
print(SE)
## [1] 13.58864
mean(TRGmeans)
## [1] 116.9452
The sampling distribution of the sample means of triglycerdie levels is approximately Normal with a mean of about 117 mg/DL and a standard error of 13.6 mg’dL
Defining the z-score formula to suit the sampling distribution of the means from above will give us the following code.
z<- (TRGmeans - mean(pop))/SE
hist(z, col="gray", border="white", freq=FALSE, breaks = 15, main="Histogram of z scores for Triglyceride levels")
zz<- seq(-3, 3, length =101)
lines(zz, dnorm(zz))
Not a perfect normal but given this we can make distributional predictions pf sample means of triglyceride levels for a sample of 25 women. This is the probability of the mean of a sample of 25 and how it relates to the mean of the sampling distribution.
pnorm(100, mean(pop), SE)
## [1] 0.1061973
There is a 10.6% chance that a sample will be less than 100 if the true mean is 117.
qnorm(0.9, mean(pop), SE)
## [1] 134.3597
A sample mean triglyceride level of 134.3597 mg/dL represents the cut-off for the top 10%
IQR <- qnorm(0.75, mean(pop), SE) - qnorm(0.25, mean(pop), SE)
print(IQR)
## [1] 18.3308
IQRpop <-qnorm(0.75, mean(pop), SD) - qnorm(0.25, mean(pop), SD)
print(IQRpop)
## [1] 91.65401
The middle 50% of sample means of triglyceride levels only vary by 18.33 mg/dL, while the populations middle 50% varies bu 91.65 mg/dL. This confirms the Central Limit Theorem- sample means will bemore Normal and less variable as sample size increases.
pnorm(140, mean(pop), SE, lower.tail= FALSE)
## [1] 0.04488361
pnorm(140, mean (pop), SD, lower.tail=FALSE)
## [1] 0.3671823
pnorm(140, mean(pop), SD, lower.tail=FALSE)
## [1] 0.3671823
It would be highly unusual to see a sample mean greater 140 mg/dL. We would only expect to see this mean 4.5% of the ime. However, seeing an individual above 140 mg/dL is much more likely, we would see this result 36.7% of the time.
Null Hypothesis:
mean= 117. The average triglyceride level is 117 mg/dL
Alternative Hypothesis: mean <117. The average triglyceride level is less than 117 mg/dL.
pnorm(97, mean(pop), SE)
## [1] 0.07108196
Though the difference is considerable. We expect to see a sample mean of 97 mg/dL 7.1% of the time. This higher than the standard significance level of 5%, so we RETAIN THE NULL. There is not enough evidence that the experimental drug lower triglyceride levels in women.