The Data

download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)

Exercise 1

Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

hist(fdims$hgt, main="Height Histogram", xlab="Heights in cm",col=rgb(1,0,0,0.5),xlim=c(140,200),ylim=c(0,80))
hist(mdims$hgt, main="Female Height Histogram", col=rgb(0,0,1,0.5),add=T)
box()

From the graph, it is apparent that males (blue) are generally taller and have higher proportions after the mode, while females (red) have a higher percentage below the mode.

The normal distribution

fhgtmean <- mean(fdims$hgt)
fhgtsd   <- sd(fdims$hgt)
hist(fdims$hgt, probability = TRUE, ylim = c(0,0.06))
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Based on the plotted data, and normalized distribution curve, it is safe to assume that this follows a near normal distribution.

Evaluating the normal distribution

qqnorm(fdims$hgt)
qqline(fdims$hgt)

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

hist(sim_norm, probability = TRUE, main="Normal vs Real Female Heights", xlab="Heights in cm",col=rgb(1,0,0,1),xlim=c(140,190),breaks=10)
hist(fdims$hgt, main="Female Height Histogram", col=rgb(0,1,0,0.5),add=T,probability = TRUE,breaks=10)
box()

If we overlay these plots, the answers are clear. We can see where alignments between normal and real differ.

qqnormsim(fdims$hgt)

qqnorm(fdims$hgt); qqline(fdims$hgt)

Exercise 4

Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

Using the QQ plot and histogram overlay, we see that female heights are not normal, but follow the normal pattern moreso than any other, but with an imbalance around the mode.

Exercise 5

Using the same technique, determine whether or not female weights appear to come from a normal distribution.

fwgtmean <- mean(fdims$wgt)
fwgtsd   <- sd(fdims$wgt)
sim_normweight <- rnorm(n = length(fdims$wgt), mean = fwgtmean, sd = fwgtsd)
hist(sim_normweight, probability = TRUE, main="Normal vs Real Female Weight", xlab="Weight in kg",col=rgb(1,0,0,1),xlim=c(30,90),ylim=c(0,0.05),breaks=15)
hist(fdims$wgt, main="Female Height Histogram", col=rgb(0,1,0,0.5),add=T,probability = TRUE,breaks=15)
x <- 30:90
y <- dnorm(x = x, mean = fwgtmean, sd = fwgtsd)
lines(x = x, y = y, col = "blue")
box()

qqnorm(fdims$wgt); qqline(fdims$wgt)

A similar “near-normal” distribution arises when we plot the weights against a normal simulation.

Normal probabilities

1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
sum(fdims$hgt > 182) / length(fdims$hgt)

Exercise 6

Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

Question 1: What is the probability that a female is over 70kg?

norm1<-1-pnorm(q = 70, mean = fwgtmean, sd = fwgtsd)
emp1<-sum(fdims$wgt > 70) / length(fdims$wgt)

Question 2: What is the probability that a female is between 150-160cm?

norm2<-pnorm(160, mean = fhgtmean, sd = fhgtsd) - pnorm(150, mean = fhgtmean, sd = fhgtsd)
emp2<-sum(fdims$hgt > 150) / length(fdims$wgt) - sum(fdims$hgt > 160) / length(fdims$wgt)
norm1-emp1
## [1] 0.006461585
norm2-emp2
## [1] -0.04092797

From this, we see that the second, female weight variable, is less “normal”, based upon these discrepancies.

On Your Own

1

Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

  1. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.

  2. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.

  3. The histogram for general age (age) belongs to normal probability plot letter D.

  4. The histogram for female chest depth (che.de) belongs to normal probability plot letter A.

2

Note that normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case?

These values are rounded, and not always well represented in bargraphs or histograms.

3

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

qqnorm(fdims$kne.di); qqline(fdims$kne.di)

hist(fdims$kne.di, main="Female Knee Diameter", col=rgb(0,1,0,1),breaks=15, xlab="Diameter")

The data is noticeably right-skewed.