The Data

download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")

Look at the first few rows

head(bdims)
##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri.gi age  wgt   hgt sex
## 1   16.5  21 65.6 174.0   1
## 2   17.0  23 71.8 175.3   1
## 3   16.9  28 80.7 193.5   1
## 4   16.6  23 72.6 186.5   1
## 5   18.0  22 78.8 187.2   1
## 6   16.9  21 74.8 181.5   1

Create 2 new data sets for male and female dimensions

mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)

Exercise 1

Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

par(mfrow = c(1,2))
# meanfhgt <- mean(fdims$hgt)
hist(mdims$hgt,xlim = c(140,200), ylim = c(0,80))
hist(fdims$hgt, xlim = c(140,200), ylim = c(0,80))

# abline(v=meanfhgt, col = "blue", lwd = 2)

The normal distribution

Plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution

fhgtmean <- mean(fdims$hgt)
fhgtsd   <- sd(fdims$hgt)
library(tidyverse)
## -- Attaching packages -------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## -- Conflicts ----------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
new <- mutate(fdims, z=(fdims$hgt - fhgtmean)/fhgtsd)
hist(new$z, xlim = c(-4, 4))

hist(fdims$hgt, probability = TRUE, ylim = c(0, 0.06), col = "green")
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Answer: Yes, the plot appears relatively normal as it fits the normal density curve.

Evaluating the normal distribution

qqnorm(fdims$hgt)
qqline(fdims$hgt)

Simulate data from a normal distribution to see how it would look on a QQ Plot

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

qqnormsim(fdims$hgt)

Exercise 4

Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

Answer: Although some of the plots deviate slightly at the upper and lower ends, they appear nearly normal for the most part.

Exercise 5

Using the same technique, determine whether or not female weights appear to come from a normal distribution.

fwgtmean <- mean(fdims$wgt)
fwgtsd   <- sd(fdims$wgt)
sim_norm_wgt <- rnorm(n = length(fdims$wgt), mean = fwgtmean, sd = fwgtsd)
qqnormsim(fdims$wgt)

Answer to Exercise 5: The simulated distributions of female weights do not seem as normal as the heights. There appear to be points on the upper right that bend upward, which indicates higher than expected values for the upper weights. On the other hand, most of the points align with the QQ Line, so overall a relatively normal distribution.

Normal probabilities

Calculate the z-score for a height of 182 cm (6 feet tall)

1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387

Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.

sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154

Exercise 6

Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

Answer: These questions will vary from student to student.

On Your Own

Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

  1. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.

  2. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.

  3. The histogram for general age (age) belongs to normal probability plot letter D__.

  4. The histogram for female chest depth (che.de) belongs to normal probability plot letter A.

Note that normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case?

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

hist(fdims$kne.di)

qqnorm(fdims$kne.di)
qqline(fdims$kne.di)

Answer: The distribution of kne.di is very skewed right. The qqplot arcs upward in the upper right, meaning there are values much higher than expected. On the lower end, there are many more lower values than expected, which looks like it arcs upward in the lower corner of the qqplot, as well.