The normal distribution

The Data

load("more/bdims.RData")

mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)

Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

hist(fdims$hgt, probability = TRUE)

hist(mdims$hgt, probability = TRUE)

qqnorm(fdims$hgt)
qqline(fdims$hgt)

qqnorm(mdims$hgt)
qqline(mdims$hgt)

They both appears to be normal with few outliers. This is celarly visible in the normal qq plots. Based on the histogram males height seem to more normal than females but qq plots shows both are very similar.

fhgtmean <- mean(fdims$hgt)
fhgtsd   <- sd(fdims$hgt)
mhgtmean <- mean(mdims$hgt)
mhgtsd   <- sd(mdims$hgt)

hist(fdims$hgt, probability = TRUE)
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

hist(mdims$hgt, probability = TRUE)
x <- 150:210
y <- dnorm(x = x, mean = mhgtmean, sd = mhgtsd)
lines(x = x, y = y, col = "blue")

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Yes it does. Also based on the qq plot, it closely follows the line with few outliers.

Evaluating the normal distribution

qqnorm(fdims$hgt)
qqline(fdims$hgt)

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

Yes it does, we are using the same mean and sd so it should follow that closely

hist(sim_norm, probability = TRUE)
x <- 140:190
y <- dnorm(x = x, mean = mean(sim_norm), sd = sd(sim_norm))
lines(x = x, y = y, col = "blue")

qqnorm(sim_norm)
qqline(sim_norm)

qqnormsim(fdims$hgt)

Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

Yes they do look very similar with one or two outliers. They all have the same and mean and SD.

Using the same technique, determine whether or not female weights appear to come from a normal distribution.

This distribution is not as normal as height, the distribution is right skewed and also has a longer tail indicating the data is more spread out.

hist(fdims$wgt, probability = TRUE)
x <- 10:120
y <- dnorm(x = x, mean = mean(fdims$wgt), sd = sd(fdims$wgt))
lines(x = x, y = y, col = "blue")

qqnorm(fdims$wgt)
qqline(fdims$wgt)

Based on the histogram and normal qq plot, this is right skewed with some outliers

Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
What is the probability that a randomly chosen young adult female is shorter than 5 feet (about 152 cm)?

pnorm(q = 152, mean = fhgtmean, sd = fhgtsd)

## [1] 0.02459975

sum(fdims$hgt <= 152) / length(fdims$hgt)

## [1] 0.02692308

What is the probability that a randomly chosen young adult female weight more than 50?

fwgtmean <- mean(fdims$wgt)
fwgtsd <- sd(fdims$wgt)
1-pnorm(q = 50, mean = fwgtmean, sd = fwgtsd)

## [1] 0.864857

sum(fdims$wgt > 50) / length(fdims$wgt)

## [1] 0.8807692

1-pnorm(q = 60, mean = fwgtmean, sd = fwgtsd)

## [1] 0.524893

sum(fdims$wgt > 60) / length(fdims$wgt)

## [1] 0.4384615

The examples I used both are very close, but if I were to use weights right of the mean(positive z score) of weight then difference apprears to be a lot higher. * * *

On Your Own

Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.

b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.

c. The histogram for general age (age) belongs to normal probability plot letter D.

d. The histogram for female chest depth (che.de) belongs to normal probability plot letter A.
Note that normal probability plots C and D have a slight stepwise pattern.
Why do you think this is the case?

Due to measurement rounding
As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

Right skewed.

qqnorm(fdims$kne.di)
qqline(fdims$kne.di)

hist(fdims$kne.di, probability = TRUE)
x <- 140:190
y <- dnorm(x = x, mean = mean(fdims$kne.di), sd = sd(fdims$kne.di))
lines(x = x, y = y, col = "blue")

histQQmatch

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.