The Data

download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
head(bdims)
##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri.gi age  wgt   hgt sex
## 1   16.5  21 65.6 174.0   1
## 2   17.0  23 71.8 175.3   1
## 3   16.9  28 80.7 193.5   1
## 4   16.6  23 72.6 186.5   1
## 5   18.0  22 78.8 187.2   1
## 6   16.9  21 74.8 181.5   1

A Data set that contains body measurements from 247 healthy men and 260 healthy women

mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)

2 additional data set were created for men and women seperately because they have different body dimensions.

Exercise 1

Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

The Normal Distribution

fhgtmean <- mean(fdims$hgt)
fhgtsd   <- sd(fdims$hgt)

The mean and standard deviation are the only 2 pieces of information needed to find a normal distribution. This is just for the woman’s height.

hist(fdims$hgt, probability = TRUE, col = "gray", ylim = c(0, 0.06), xlim = c(140, 190))
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

A backdrop of the density histogram was created and “lines” was used to overlay the normal probability curve. The X and Y axis were adjusted

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Evaluating The Normal Distribution

qqnorm(fdims$hgt)
qqline(fdims$hgt)

The Q-Q or Quartile-Quartile plot allows us to determine how close the raw data is to being normally distributed. Deviations from normality lead to deviations from the line.

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)

This is a simulated data set from a normal distribution. *** What am I supposed to be seeing?

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

qqnormsim(fdims$hgt)

Compares the original plot to many more functions

Exercise 4

Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

Yes, they the original data set does look similar to the simulated data. There is evidence that the female heights are nearly normal.

Exercise 5

Using the same technique, determine whether or not female weights appear to come from a normal distribution.

qqnormsim(fdims$wgt)

Normal Probabilities

What is the probability that a randomly chosen young adult female is taller than 6 feet (about 182 cm)?

1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387

We know that pnorm gives us the area under the normal curve below a given number (or to the left). We must take 1 minus that in order to find the area above that number.

sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154

This allows us to empirically calculate the probability.

Exercise 6

Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

2.a. Whats the probability that a randomly chosen young male adult weighs more than 100 kg?

mwgtmean <- mean(mdims$wgt)
mwgtsd   <- sd(mdims$wgt)
1 - pnorm(q = 100, mean = mwgtmean, sd = mwgtsd)
## [1] 0.01881231

b.What’s the empirical distribution?

sum(mdims$wgt > 100) / length(mdims$wgt)
## [1] 0.02834008

2.a.Whats the probability that a randomly chosen young female adult weighs more than 80 kg?

fwgtmean <- mean(fdims$wgt)
fwgtsd   <- sd(fdims$wgt)
1 - pnorm(q = 80, mean = fwgtmean, sd = fwgtsd)
## [1] 0.02182199

b.What’s the empirical distribution?

sum(fdims$wgt > 80) / length(fdims$wgt)
## [1] 0.04230769

On My Own

  1. Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help.
  1. The histogram for female biiliac diameter belongs to normal probability plot letter B.
  2. The histogram for female elbow diameter belongs to normal probability plot letter C.
  3. The histogram for general age belongs to normal probability plot letter D.
  4. The histogram for female chest depthbelongs to normal probability plot letter A.
  1. Note that normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case?

I think this is the case bacause it’s so extremely skewed to the right, so they have far more values that are above normaility.

  1. As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
fkneemean <- mean(fdims$kne.di)
fkneesd   <- sd(fdims$kne.di)
hist(fdims$kne.di, probability = TRUE, col = "yellow", ylim = c(0, 0.4), xlim = c(14, 26))

ylim = c(0, 0.06), xlim = c(140, 190)