The normal distribution

The Data

rm(list=ls())

load("more/bdims.RData")

mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)

Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

Height is normally distributed and clustered somewhat. Mean height for males is 177.7 cm or 5 ft, 10 in. Mean height for females is 164.9 cm or 5 ft, 5 in.

require(ggplot2) p1<-ggplot(mdims,aes(x=hgt))+geom_histogram(binwidth=1,color="darkblue", fill="lightblue")+theme_bw()+ylim(c(0,30)) p2<-ggplot(fdims,aes(x=hgt))+geom_histogram(binwidth=1,color="darkred", fill="lightpink")+theme_bw()+ylim(c(0,30)) require(gridExtra) grid.arrange(p1,p2,nrow=1)

mdims_sd <- sd(mdims$hgt)*sqrt((length(mdims$hgt)-1)/(length(mdims$hgt))) mdims_mean <- mean(mdims$hgt) fdims_sd <- sd(fdims$hgt)*sqrt((length(fdims$hgt)-1)/(length(fdims$hgt))) fdims_mean <- mean(fdims$hgt) Z_stat<-(mdims_mean-fdims_mean)/sqrt(mdims_sd^2 + fdims_sd^2)

The Z-statistic 1.33 < 1.96 which implies the distribution of height among females and males in this sample are very similar.

The normal distribution

fhgtmean <- mean(fdims$hgt) fhgtsd <- sd(fdims$hgt) hist(fdims$hgt, probability = TRUE,ylim = c(0, 0.06)) x <- 140:190 y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd) lines(x = x, y = y, col = "blue")

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Yes, the data follow a nearly normal distribution.

Evaluating the normal distribution

qqnorm(fdims$hgt) qqline(fdims$hgt)

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

Most points fall on the line, though many do not; the plots are almost identical. The points are distributed in astep-like fashion.

qqnorm(sim_norm) qqline(sim_norm)

Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function. It may be helpful to click the zoom button in the plot window.

qqnormsim(fdims$hgt)

Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

The normal probability plot looks very similar; this provides evidence that the female heights are nearly normal.

Using the same technique, determine whether or not female weights appear to come from a normal distribution.

Female weights are not normally distributed; the tails diverge from the line and two data points stray from the normal distribution.

qqnormsim(fdims$wgt)

Normal probabilities

Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

What is the probability that a randomly chosen young adult female is shorter than 5.5 feet (about 168 cm)?

theory1<-round(pnorm(q = 168, mean = fhgtmean, sd = fhgtsd)*100,1) empirical1<-round((sum(fdims$hgt < 168) / length(fdims$hgt))*100,1)

The theoretical probability that a female randomly chosen from our sample would be shorter than 168 cm (or 5.5 ft) is 68.4%; the empirical probability is 68.5%.
What is the probability that a randomly chosen young adult female is lighter than 55 kg (about 121 lbs)?

fwgtmean <- mean(fdims$wgt) fwgtsd <- sd(fdims$wgt) theory2<-round(pnorm(q = 55, mean = fwgtmean, sd = fwgtsd),1) empirical2<-round((sum(fdims$wgt < 55) / length(fdims$wgt))*100,1)

The theoretical probability that a female randomly chosen from our sample would be lighter than 55 kg (or 121 lbs) is 0.3%; the empirical probability is 28.8%.

On Your Own

Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.

b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.

c. The histogram for general age (age) belongs to normal probability plot letter D.

d. The histogram for female chest depth (che.de) belongs to normal probability plot letter A.

Note that normal probability plots C and D have a slight stepwise pattern.
Why do you think this is the case?

This is by the chunking in the data; the plot would be smoother if frequencies of values were more closely aligned with a normal distribution.

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

The plot below indicates that the distribution is right skewed, and as shown below, the histogram confirms this assertion.

qqnorm(fdims$kne.di) qqline(fdims$kne.di) fkne.dimean <- mean(fdims$kne.di) fkne.disd <- sd(fdims$kne.di) hist(fdims$kne.di, breaks=100)