The Normal Distribution

download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")

Exercise 1

Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

men = subset(bdims,bdims$sex == 1)
women = subset(bdims,bdims$sex == 0)

hist(men$hgt, breaks = 20)

hist(women$hgt, breaks = 20)

Both distributions look similar in that they are both normal. The womens’ mean is smaller than the mens’.

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

wmean <- mean(women$hgt)
wsd   <- sd(women$hgt)
hist(women$hgt, probability = TRUE)
x = 140:190
y = dnorm(x = x, mean = wmean, sd = wsd)
lines(x = x, y = y, col = "blue")

Based on this plot alone, it does appear that the data follows a normal distribution, however we can’t be sure that is the case it would be easier to determine if this was true by perhaps looking at a QQ plot.

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

sim_norm = rnorm(n = length(women$hgt), mean = wmean, sd = wsd)
qqnorm(sim_norm)
qqline(sim_norm)

qqnormsim(women$hgt)

Even if not all the points fall on the line, the majority of them do and the ones that don’t are on the ends, where there can be some allowance for going off course.

Exercise 4

Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

Compared to the probability plots of the original womens’ heights’ data, the results are very similar.

Exercise 5

Using the same technique, determine whether or not female weights appear to come from a normal distribution.

qqnormsim(women$wgt)

The data for the womens’ weight may not be normal due to the appearance of curvature on some of the simulated lines. The data could be skewed.

Exercise 6

Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

What percent of females have a height of 150cm or less? What percent of females have a weight of 70kg or more?

# Heights
pnorm(q = 150, mean = wmean, sd = wsd)

## [1] 0.01152955

sum(women$hgt < 150) / length(women$hgt)

## [1] 0.01153846

# Weights
wwmean <- mean(women$wgt)
wwsd   <- sd(women$wgt)
1 - pnorm(q = 70, mean = wwmean, sd = wwsd)

## [1] 0.1641539

sum(women$wgt > 70) / length(women$wgt)

## [1] 0.1576923

Height was the variable that had a closer agreement between the two methods.

ON YOUR OWN

1. Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.

qqnorm(women$bii.di)
qqline(women$bii.di)

b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.

qqnorm(women$elb.di)
qqline(women$elb.di)

c. The histogram for general age (age) belongs to normal probability plot letter D.

qqnorm(bdims$age)
qqline(bdims$age)

d. The histogram for female chest depth (che.de) belongs to normal probability plot letter A.

qqnorm(women$che.de)
qqline(women$che.de)

2. Note that normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case?

Stepwise patterns are more prominent in discrete data. The variables for pelvic and elbow diameter were probably not recorded as integers, while age and chest depth probably were.

3. As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

qqnorm(women$kne.di)
qqline(women$kne.di)

hist(women$kne.di, breaks = 20)

Based on the normal probability plot, the variable appears to be right skewed. This appears to be confirmed after viewing the histogram.

The Normal Distribution

Georgia Galanopoulos

ON YOUR OWN