606 Lab 3

The Data

This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.

load('/Users/EKandTower/Dropbox/cuny_msds/rwd/Lab3/more/bdims.RData')

Let’s take a quick peek at the first few rows of the data.

head(bdims)

##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri.gi age  wgt   hgt sex
## 1   16.5  21 65.6 174.0   1
## 2   17.0  23 71.8 175.3   1
## 3   16.9  28 80.7 193.5   1
## 4   16.6  23 72.6 186.5   1
## 5   18.0  22 78.8 187.2   1
## 6   16.9  21 74.8 181.5   1

You’ll see that for every observation we have 25 measurements, many of which are either diameters or girths.

Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.

mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)

Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

hist(mdims$hgt, main = 'Histogram of Male Heights', xlab = 'Height in inches')

hist(fdims$hgt, main = 'Histogram of Female Heights', xlab = 'Height in inches')

Both historgrams suggest a fairly normal curve, the Male much more than the Female.

The normal distribution

fhgtmean <- mean(fdims$hgt)
fhgtsd   <- sd(fdims$hgt)

Next we make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve.

hist(fdims$hgt, probability = TRUE)
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = 'blue')

The top of the curve is cut off because the limits of the x- and y-axes are set to best fit the histogram. To adjust the y-axis you can add a third argument to the histogram function: ylim = c(0, 0.06).

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Yes

Evaluating the normal distribution

Eyeballing the shape of the histogram is one way to determine if the data appear to be nearly normally distributed, but it can be frustrating to decide just how close the histogram is to the curve. An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for ‘quantile-quantile’.

qqnorm(fdims$hgt)
qqline(fdims$hgt)

A data set that is nearly normal will result in a probability plot where the points closely follow the line. We’re left with the same problem that we encountered with the histogram above: how close is close enough?

A useful way to address this question is to rephrase it as: what do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

qqnorm(sim_norm)
qqline(sim_norm)

Not all of the points fall on the line, but most of them are on or very close to it.

Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function. It may be helpful to click the zoom button in the plot window.

qqnormsim(fdims$hgt)

Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

Yes, these plot all appear to be similarly close to the line in most cases.
Using the same technique, determine whether or not female weights appear to come from a normal distribution.

qqnormsim(fdims$wgt)

Yes, these also look like they follow a normal distribution

Normal probabilities

If we assume that female heights are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm.

1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)

## [1] 0.004434387

Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.

sum(fdims$hgt > 182) / length(fdims$hgt)

## [1] 0.003846154

Although the probabilities are not exactly the same, they are reasonably close. The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be.

Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights.

Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all).
Which variable,height or weight, had a closer agreement between the two methods?

What is the probability that a random female participant is below the national US average weight (76.4 kg)

fwgtmean <- mean(fdims$wgt)
fwgtsd   <- sd(fdims$wgt)

fw_normdis <- pnorm(q = 76.4, mean = fwgtmean, sd = fwgtsd)
np <- format(round(fw_normdis*100, 2), nsmall = 2)

fw_empdis <- sum(fdims$wgt < 76.4) / length(fdims$wgt)
ep <- format(round(fw_empdis*100, 2), nsmall = 2)

wdiff <- fw_normdis-fw_empdis

cat('Using normal distribution laws there is a ', np, '% chance that a random female participant will be below the national US average weight while using empirical measures there is a ', ep, '% chance. These figures are very close with only ', wdiff, 'between them.')

## Using normal distribution laws there is a  94.98 % chance that a random female participant will be below the national US average weight while using empirical measures there is a  94.23 % chance. These figures are very close with only  0.007511841 between them.

What is the probability that a random female participant is above the national US average height (161.8 cm)

fh_normdis <- 1-pnorm(q = 161.8, mean = fhgtmean, sd = fhgtsd)
np <- format(round(fh_normdis*100, 2), nsmall = 2)

fh_empdis <- sum(fdims$hgt > 161.8) / length(fdims$hgt)
ep <- format(round(fh_empdis*100, 2), nsmall = 2)

wdiff <- fh_normdis-fh_empdis

cat('Using normal distribution laws there is a ', np, '% chance that a random female participant will be above the national US average height while using empirical measures there is a ', ep, '% chance. These figures are close with only ', wdiff, 'between them.')

## Using normal distribution laws there is a  68.06 % chance that a random female participant will be above the national US average height while using empirical measures there is a  65.77 % chance. These figures are close with only  0.02293064 between them.

Theoretical vs empirical calculations were definitely closer in the weight question than in the height question

On Your Own

Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.

b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.

c. The histogram for general age (age) belongs to normal probability plot letter D.

d. The histogram for female chest depth (che.de) belongs to normal probability plot letter A.
Note that normal probability plots C and D have a slight stepwise pattern.
Why do you think this is the case?

For D (which I believe is age) the pattern makes sense becase age is only recorded in whole numbers while the other metrics are to 1 decimal, creating a more continuous pattern

C (which I assume is elbow diameter) is a little less clear-cut, but looking at more detailed metrics, elb.di has a very small standard deviation (0.84) and IQR (1.1), as well as a fairly tight total range (5.1).
As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

I believe the shape of this probability suggests that it is short-tailed, but symetric. I'm not convinced, though, so I'm going to run a few simulations to see if there's anything else that emerges.

qqnormsim(fdims$kne.di)

A couple of the plots look right skewed (the first one very much so) but most support my original theory.

hist(fdims$kne.di)

The histogram supports my original thought, a very tight, fairly normal distribution. 

It also has a somewhat right skew, which was evident from my other plots.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.