Lab 3: The Normal Distribution

Body Dimensions: This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.

load("more/bdims.RData")

Let’s take a quick peek at the first few rows of the data.

head(bdims)

##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri.gi age  wgt   hgt sex
## 1   16.5  21 65.6 174.0   1
## 2   17.0  23 71.8 175.3   1
## 3   16.9  28 80.7 193.5   1
## 4   16.6  23 72.6 186.5   1
## 5   18.0  22 78.8 187.2   1
## 6   16.9  21 74.8 181.5   1

mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)

Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

#mens heights
library(ggplot2)
qplot(mdims$hgt, binwidth = 5)

#womens heights
qplot(fdims$hgt, binwidth = 5)

Both appear relatively normal and bell-shaped.

The normal distribution

To see how accurate that description is, we can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution.

fhgtmean <- mean(fdims$hgt)
fhgtsd   <- sd(fdims$hgt)

Next we make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve.

hist(fdims$hgt, probability = TRUE,  ylim = c(0, 0.06))
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Yes, it seems approximately normal

Evaluating the normal distribution

An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”.

qqnorm(fdims$hgt)
qqline(fdims$hgt)

A data set that is nearly normal will result in a probability plot where the points closely follow the line. Any deviations from normality leads to deviations of these points from the line.

What do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

qqnorm(sim_norm)
qqline(sim_norm)

Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function.

qqnormsim(fdims$hgt)

Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

While there is some more variation on the ends of female height data, in general it appears as though female heights are nearly normal.

Using the same technique, determine whether or not female weights appear to come from a normal distribution.

qqnorm(fdims$wgt)
qqline(fdims$wgt)

The distribution for weight does not seem as normal however.

Normal probabilities

Once we decide that a random variable is approximately normal, we can answer all sorts of questions about that variable related to probability.

“What is the probability that a randomly chosen young adult female is taller than 6 feet (about 182 cm)?” If we assume that female heights are normally distributed, we can find this probability by calculating a Z score and consulting a Z table. In R, this is done in one step with the function pnorm.

1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)

## [1] 0.004434387

Note that the function pnorm gives the area under the normal curve below a given value, q, with a given mean and standard deviation. Since we’re interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.

Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.

sum(fdims$hgt > 182) / length(fdims$hgt)

## [1] 0.003846154

Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

What is the probability that a random chosen young adult female is shorter than 5’3’’ (~160 cm)?

pnorm(q = 160, mean = fhgtmean, sd = fhgtsd)

## [1] 0.2282939

sum(fdims$hgt < 160) / length(fdims$hgt)

## [1] 0.1923077

What is the probability that a random chosen young adult female is heavier than 54kg?

fwgtmean <- mean(fdims$wgt)
fwgtsd   <- sd(fdims$wgt)
pnorm(q = 54, mean = fwgtmean, sd = fwgtsd)

## [1] 0.2462249

sum(fdims$wgt < 54) / length(fdims$wgt)

## [1] 0.2192308

Height is closer in general

On Your Own

Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter ____.

Plot B

**b.** The histogram for female elbow diameter (`elb.di`) belongs to normal probability plot letter ____.

Plot C

**c.** The histogram for general age (`age`) belongs to normal probability plot letter ____.

Plot D

**d.** The histogram for female chest depth (`che.de`) belongs to normal probability plot letter ____.

Plot A

Note that normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case?

Likely because of the integer values provided in the data set. Age was given in integers making the jumps a bit more obvious. Perhaps for elbow diameters, many people cluster around the same diameter, since it functions in the same way for many people.

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

From the probability plot it appears as though there are less values as the quantites increase. Therefore, it appears as through it is skewed right.

qqnorm(fdims$kne.di)
qqline(fdims$kne.di)

hist(fdims$kne.di)

histQQmatch

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.