The Normal Distribution
The Data
download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
head(bdims)
## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi che.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2 89.5
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5 97.0
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1 97.5
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5 97.0
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5 97.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8 99.9
## wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi wri.gi age
## 1 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5 16.5 21
## 2 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5 17.0 23
## 3 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9 16.9 28
## 4 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0 16.6 23
## 5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4 18.0 22
## 6 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5 16.9 21
## wgt hgt sex
## 1 65.6 174.0 1
## 2 71.8 175.3 1
## 3 80.7 193.5 1
## 4 72.6 186.5 1
## 5 78.8 187.2 1
## 6 74.8 181.5 1
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)
mean(mdims$hgt)
## [1] 177.7453
median(mdims$hgt)
## [1] 177.8
mean(fdims$hgt)
## [1] 164.8723
median(fdims$hgt)
## [1] 164.5
Exercise 1
Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?
hist(mdims$hgt, main = "Histogram of Male Heights", xlab = "Height (cm)")
abline(v = mean(mdims$hgt),col = "royalblue",lwd = 2)
abline(v = median(mdims$hgt),col = "red",lwd = 2)
legend(x = "topright", # location of legend within plot area
c("Mean (177.7)", "Median (177.8)"),
col = c("royalblue", "red"),
lwd = c(2, 2))

hist(fdims$hgt, main = "Histogram of Female Heights", xlab = "Height (cm)")
abline(v = mean(fdims$hgt),col = "royalblue",lwd = 2)
abline(v = median(fdims$hgt),
col = "red",
lwd = 2)
legend(x = "topright", # location of legend within plot area
c("Mean (164.9)", "Median (164.5)"),
col = c("royalblue", "red"),
lwd = c(2, 2))

Answer: Both the male and female distribution of heights depict a fairly normal and symmetric distribution. The male mean and median heights are nearly identical at 177cm. The female mean and median are both at 164. As those figures illustrate the average male height is approximately 13cm higher than the average female height. The male distribution is slightly skewed to the left whereas the female distribution is slightly skewed toward the right.
The Normal Distribution
fhgtmean <- mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)
hist(fdims$hgt, main = "Histogram of Female Height (cm)", xlab = "Heights (cm)", probability = TRUE)
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

Exercise 2
Based on this plot, does it appear that the data follow a nearly normal distribution?
Answer: Yes, based on this plot, the female height data does appear to follow a nearly normal distribution.
Evaluating the Normal Distribution
qqnorm(fdims$hgt)
qqline(fdims$hgt)

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
Exercise 3
Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?
qqnorm(sim_norm)
qqline(sim_norm)

Answer: No, not all of the points fall on the line. This plot is very similar compared to the probability plot for the real data. The data is mostly aligned with the line between -2 and 2 standard deviation with greater distance to the line in standard deviations -2 to -3 and 2 to 3. This pattern is the same with both data sets. The real female height data set show slightly more pattern variation, like a sawtooth shape, in the -2 to 2 standard deviation area than the theoretical quantiles dataset. However that difference is quite minor.
qqnormsim(fdims$hgt)

Exercise 4
Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?
Answer: Yes the plots provide evidence that the female heights are nearly normal. All of the plots are consistent with the line with only minor variations.
Exercise 5
Using the same technique, determine whether or not female weights appear to come from a normal distribution.
qqnormsim(fdims$wgt)

Answer: Using the same technique, female weights do appear to come from a normal distribution. All of the simulated plots appear consistent with the normality line.
Normal Probabilities
1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387
sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154
Exercise 6
Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
Answer: Question 1: “What is the probability that a randomly chosen young adult female is shorter than 5 feet (about 152 cm)?
1 - pnorm(q = 152, mean = fhgtmean, sd = fhgtsd)
## [1] 0.9754002
sum(fdims$hgt < 152) / length(fdims$hgt)
## [1] 0.01923077
The probability of 0.97 for the normal distribution is very different than the probability of the fdims data of 0.01. Since this data is for a height on the far end of the normality curve, where greater variation is evident from the graphs, and there is more of an error band, or less confidence, in the data at these extreme points. This data may still be consistent with a normal distribution.
Answer: Question 2: “What is the probability that a randomly chosen young adult female weighs more than 68kgs?
fwgtmean <- mean(fdims$wgt)
fwgtsd <- sd(fdims$wgt)
hist(fdims$wgt, main = "Histogram of Female Weight (kg)", xlab = "Weight (kg)", probability = TRUE)
x <- 40:110
y <- dnorm(x = x, mean = fwgtmean, sd = fwgtsd)
lines(x = x, y = y, col = "blue")

1 - pnorm(q = 68, mean = fwgtmean, sd = fwgtsd)
## [1] 0.2207879
sum(fdims$wgt > 68) / length(fdims$wgt)
## [1] 0.1923077
Answer: The probability of .22 on a theoretical normal distribution is greater than the probability of 0.19 from the actual data set of female weights. There is less variation in apparent in this difference than was present in question 1 regarding female heights under 5 ft (152cm). This suggestion that a weight of 68 is within 2 standard deviation of the mean since the graphs showed us that within that area there is more consistency between the data set and a normal distribution.
On Your Own
Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.
a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter. B
b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter _ C___.
c. The histogram for general age (age) belongs to normal probability plot letter _ D__.
d. The histogram for female chest depth (che.de) belongs to normal probability plot letter A__.
2. Note that normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case?
Answer: normal probability plots C (elbow diameter) and D (age) have a slight stepwise pattern because the data samples are discrete, that is they can only be counted by intervals, not by subdivided parts. (i.e., 1, 2, 3 rather than 1.5, 2.25, 3.54 etc.)
3. As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
qqnorm(fdims$kne.di)
qqline(fdims$kne.di)

The QQ plot shows the data to be skewed right. The histogram below confirms this.
hist(fdims$kne.di, main = "Histogram of Female Knee Diameters", xlab = "Female Knee Diameters")
