In this lab we’ll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.
This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.
load("more/bdims.RData")
head(bdims)## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8
## che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1 89.5 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5
## 2 97.0 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5
## 3 97.5 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9
## 4 97.0 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0
## 5 97.5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4
## 6 99.9 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5
## wri.gi age wgt hgt sex
## 1 16.5 21 65.6 174.0 1
## 2 17.0 23 71.8 175.3 1
## 3 16.9 28 80.7 193.5 1
## 4 16.6 23 72.6 186.5 1
## 5 18.0 22 78.8 187.2 1
## 6 16.9 21 74.8 181.5 1
Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.
Variables to consider: - weight in kg (wgt) - height in cm (hgt) - sex (1 indicates male, 0 indicates female)
mdims <- subset(bdims, sex == 1)
head(mdims)## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8
## che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1 89.5 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5
## 2 97.0 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5
## 3 97.5 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9
## 4 97.0 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0
## 5 97.5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4
## 6 99.9 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5
## wri.gi age wgt hgt sex
## 1 16.5 21 65.6 174.0 1
## 2 17.0 23 71.8 175.3 1
## 3 16.9 28 80.7 193.5 1
## 4 16.6 23 72.6 186.5 1
## 5 18.0 22 78.8 187.2 1
## 6 16.9 21 74.8 181.5 1
fdims <- subset(bdims, sex == 0)
head(fdims)## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 248 37.6 25.0 31.3 16.2 24.9 11.2 9.2 17.0 12.3 95.0
## 249 36.7 26.4 31.0 16.8 24.5 12.1 9.9 19.3 12.8 99.5
## 250 34.8 25.9 30.2 16.4 24.2 11.3 8.9 17.0 12.2 88.0
## 251 36.6 27.9 31.8 19.3 24.9 12.3 9.5 18.6 13.0 97.0
## 252 35.5 28.2 31.0 18.2 26.2 11.5 9.1 17.2 12.4 103.3
## 253 37.0 28.0 32.0 15.1 25.7 12.5 10.0 17.2 13.2 93.5
## che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 248 83.0 66.5 79.0 92.0 53.5 24.3 20.5 32.0 32.2 21.0
## 249 78.5 61.5 70.5 90.5 57.7 27.8 24.0 38.5 38.5 22.5
## 250 75.0 61.2 66.5 91.0 53.0 24.0 22.0 32.5 32.5 19.0
## 251 86.5 78.0 91.0 99.5 61.5 28.0 24.0 35.2 36.7 23.0
## 252 91.0 70.5 80.5 91.5 55.0 26.9 22.7 33.0 33.3 19.9
## 253 79.5 66.5 78.5 94.0 54.0 26.5 22.5 34.0 35.0 23.0
## wri.gi age wgt hgt sex
## 248 13.5 22 51.6 161.2 0
## 249 15.0 20 59.0 167.5 0
## 250 14.0 19 49.2 159.5 0
## 251 15.0 25 63.0 157.0 0
## 252 14.5 21 53.6 155.8 0
## 253 14.5 23 59.0 170.0 0
hist(mdims$hgt)hist(fdims$hgt)The histogram of men’s heights is unimodal and nearly symmetric while the female height distribution is unimodal but somewhat left skewed.
We can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data. We’ll be working with women’s heights, so let’s store them as a separate object and then calculate some statistics that will be referenced later.
fhgtmean <- mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)Next we make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve.
hist(fdims$hgt, probability = TRUE)
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")The data do appear to follow a nearly normal distribution but not exactly as it is slightly left skewed.
An alternative approach to determining if the data appear to be nearly normally distributed involves constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”.
qqnorm(fdims$hgt)
qqline(fdims$hgt)What do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.
sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
summary(sim_norm)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 143.7 160.3 165.2 165.0 169.4 185.8
sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?qqnorm(sim_norm)
qqline(sim_norm) ###Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?
Yes, the simulated points all fall fairly straight along a line as opposed to the plot of the heights of females which zigzag over the line and veer off at the ends.
Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function.
qqnormsim(fdims$hgt)fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?The female height plots do look very similar to the simulated data and show that female height is very close to being normally distributed.
fwgtmean <- mean(fdims$wgt)
fwgtsd <- sd(fdims$wgt)
hist(fdims$wgt, probability = TRUE)
x <- 40:110
y <- dnorm(x = x, mean = fwgtmean, sd = fwgtsd)
lines(x = x, y = y, col = "blue")qqnorm(fdims$wgt)
qqline(fdims$wgt)Female weight is not nearly as closely approxiamted by the normal distribution with variations with the plot being more curved than straight and the histogram being right skewed.
“What is the probability that a randomly chosen young adult female is taller than 6 feet (about 182 cm)?”
If we assume that female heights are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm.
1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)## [1] 0.004434387
Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.
sum(fdims$hgt > 182) / length(fdims$hgt)## [1] 0.003846154
“What is the probability that a randomly chosen young adult female is shorter than 5 feet(about 152 cm)?”
pnorm(q = 152, mean = fhgtmean, sd = fhgtsd)## [1] 0.02459975
sum(fdims$hgt < 152) / length(fdims$hgt)## [1] 0.01923077
“What is the probability that a randomly chosen young adult female weighs more than than 150 bls(about 68 kg)?”
1 - pnorm(q = 68, mean = fwgtmean, sd = fwgtsd)## [1] 0.2207879
sum(fdims$wgt >68) / length(fdims$wgt)## [1] 0.1923077
Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.
a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter D.
b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter A.
c. The histogram for general age (age) belongs to normal probability plot letter B.
d. The histogram for female chest depth (che.de) belongs to normal probability plot letter C.
This is likely due to the type of variable being examined not being continuous.
Note that normal probability plots C and D have a slight stepwise pattern.
Why do you think this is the case?
histQQmatch
kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.kne.mean <- mean(fdims$kne.di)
kne.sd <- sd(fdims$kne.di)
qqnorm(fdims$kne.di)
qqline(fdims$kne.di)hist(fdims$kne.di, probability = TRUE)
x <- 1:40
y <- dnorm(x = x, mean = kne.mean, sd = kne.sd)
lines(x = x, y = y, col = "blue")We can see that the female knee distribution is right skewed.