The normal distribution

In this lab we’ll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.

The Data

This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.

load("more/bdims.RData")
head(bdims)
##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri.gi age  wgt   hgt sex
## 1   16.5  21 65.6 174.0   1
## 2   17.0  23 71.8 175.3   1
## 3   16.9  28 80.7 193.5   1
## 4   16.6  23 72.6 186.5   1
## 5   18.0  22 78.8 187.2   1
## 6   16.9  21 74.8 181.5   1

Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.

Variables to consider: - weight in kg (wgt) - height in cm (hgt) - sex (1 indicates male, 0 indicates female)

mdims <- subset(bdims, sex == 1)
head(mdims)
##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri.gi age  wgt   hgt sex
## 1   16.5  21 65.6 174.0   1
## 2   17.0  23 71.8 175.3   1
## 3   16.9  28 80.7 193.5   1
## 4   16.6  23 72.6 186.5   1
## 5   18.0  22 78.8 187.2   1
## 6   16.9  21 74.8 181.5   1
fdims <- subset(bdims, sex == 0)
head(fdims)
##     bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 248   37.6   25.0   31.3   16.2   24.9   11.2    9.2   17.0   12.3   95.0
## 249   36.7   26.4   31.0   16.8   24.5   12.1    9.9   19.3   12.8   99.5
## 250   34.8   25.9   30.2   16.4   24.2   11.3    8.9   17.0   12.2   88.0
## 251   36.6   27.9   31.8   19.3   24.9   12.3    9.5   18.6   13.0   97.0
## 252   35.5   28.2   31.0   18.2   26.2   11.5    9.1   17.2   12.4  103.3
## 253   37.0   28.0   32.0   15.1   25.7   12.5   10.0   17.2   13.2   93.5
##     che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 248   83.0   66.5   79.0   92.0   53.5   24.3   20.5   32.0   32.2   21.0
## 249   78.5   61.5   70.5   90.5   57.7   27.8   24.0   38.5   38.5   22.5
## 250   75.0   61.2   66.5   91.0   53.0   24.0   22.0   32.5   32.5   19.0
## 251   86.5   78.0   91.0   99.5   61.5   28.0   24.0   35.2   36.7   23.0
## 252   91.0   70.5   80.5   91.5   55.0   26.9   22.7   33.0   33.3   19.9
## 253   79.5   66.5   78.5   94.0   54.0   26.5   22.5   34.0   35.0   23.0
##     wri.gi age  wgt   hgt sex
## 248   13.5  22 51.6 161.2   0
## 249   15.0  20 59.0 167.5   0
## 250   14.0  19 49.2 159.5   0
## 251   15.0  25 63.0 157.0   0
## 252   14.5  21 53.6 155.8   0
## 253   14.5  23 59.0 170.0   0

1. Make a histogram of men’s heights and a histogram of women’s heights.

hist(mdims$hgt)

hist(fdims$hgt)

How would you compare the various aspects of the two distributions?

The histogram of men’s heights is unimodal and nearly symmetric while the female height distribution is unimodal but somewhat left skewed.

We can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data. We’ll be working with women’s heights, so let’s store them as a separate object and then calculate some statistics that will be referenced later.

fhgtmean <- mean(fdims$hgt)
fhgtsd   <- sd(fdims$hgt)

Next we make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve.

hist(fdims$hgt, probability = TRUE)
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

2. Based on the this plot, does it appear that the data follow a nearly normal distribution?

The data do appear to follow a nearly normal distribution but not exactly as it is slightly left skewed.

Evaluating the normal distribution

An alternative approach to determining if the data appear to be nearly normally distributed involves constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”.

qqnorm(fdims$hgt)
qqline(fdims$hgt)

What do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
summary(sim_norm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   143.7   160.3   165.2   165.0   169.4   185.8

3. Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

qqnorm(sim_norm)
qqline(sim_norm)

###Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

Yes, the simulated points all fall fairly straight along a line as opposed to the plot of the heights of females which zigzag over the line and veer off at the ends.

Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function.

qqnormsim(fdims$hgt)

4. Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

The female height plots do look very similar to the simulated data and show that female height is very close to being normally distributed.

5. Using the same technique, determine whether or not female weights appear to come from a normal distribution.

fwgtmean <- mean(fdims$wgt)
fwgtsd   <- sd(fdims$wgt)

hist(fdims$wgt, probability = TRUE)
x <- 40:110
y <- dnorm(x = x, mean = fwgtmean, sd = fwgtsd)
lines(x = x, y = y, col = "blue")

qqnorm(fdims$wgt)
qqline(fdims$wgt)

Female weight is not nearly as closely approxiamted by the normal distribution with variations with the plot being more curved than straight and the histogram being right skewed.

Normal probabilities

“What is the probability that a randomly chosen young adult female is taller than 6 feet (about 182 cm)?”

If we assume that female heights are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm.

1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387

Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.

sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154

6. Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

“What is the probability that a randomly chosen young adult female is shorter than 5 feet(about 152 cm)?”

pnorm(q = 152, mean = fhgtmean, sd = fhgtsd)
## [1] 0.02459975
sum(fdims$hgt < 152) / length(fdims$hgt)
## [1] 0.01923077

“What is the probability that a randomly chosen young adult female weighs more than than 150 bls(about 68 kg)?”

1 - pnorm(q = 68, mean = fwgtmean, sd = fwgtsd)
## [1] 0.2207879
sum(fdims$wgt >68) / length(fdims$wgt)
## [1] 0.1923077

On Your Own

  • Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

    a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter D.

    b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter A.

    c. The histogram for general age (age) belongs to normal probability plot letter B.

    d. The histogram for female chest depth (che.de) belongs to normal probability plot letter C.

    This is likely due to the type of variable being examined not being continuous.

  • Note that normal probability plots C and D have a slight stepwise pattern.
    Why do you think this is the case?

histQQmatch

histQQmatch

  • As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
kne.mean <- mean(fdims$kne.di)
kne.sd   <- sd(fdims$kne.di)

qqnorm(fdims$kne.di)
qqline(fdims$kne.di)

hist(fdims$kne.di, probability = TRUE)
x <- 1:40
y <- dnorm(x = x, mean = kne.mean, sd = kne.sd)
lines(x = x, y = y, col = "blue")

We can see that the female knee distribution is right skewed.