The normal distribution

In this lab we’ll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.

The Data

This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.

library(DATA606)

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

load("more/bdims.RData")

Let’s take a quick peek at the first few rows of the data.

head(bdims)

##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri.gi age  wgt   hgt sex
## 1   16.5  21 65.6 174.0   1
## 2   17.0  23 71.8 175.3   1
## 3   16.9  28 80.7 193.5   1
## 4   16.6  23 72.6 186.5   1
## 5   18.0  22 78.8 187.2   1
## 6   16.9  21 74.8 181.5   1

You’ll see that for every observation we have 25 measurements, many of which are either diameters or girths. A key to the variable names can be found at http://www.openintro.org/stat/data/bdims.php, but we’ll be focusing on just three columns to get started: weight in kg (wgt), height in cm (hgt), and sex (1 indicates male, 0 indicates female).

Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.

mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)

1. Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

library(ggplot2)

qplot(fdims$hgt, geom="histogram", binwidth=1)

#ggplot(data=mdims, aes(mdims$hgt)) + geom_histogram())
#ggplot(data=fdims, aes(mdims$hgt)) + geom_histogram())


DF <- rbind(data.frame(fill="blue", obs=mdims$hgt),
            data.frame(fill="green", obs=fdims$hgt))
          
ggplot(DF, aes(x=obs, fill=fill)) +
  geom_histogram(binwidth=1, colour="black", position="dodge") +
  scale_fill_identity()

To get a decent look at the data I chose to bin it at 4 inches a bin
It appears that men have a larger mean and median as well as min and max height
Both datasets appear pretty normal although both have a rightwards tale.
Looking at the data in bins of 1 inch it appears women’s heights may be bimodal, from the 160-165 range and the 168-172 cm range.
- This is likely an effect of our sample size

The normal distribution

fhgtmean <- mean(fdims$hgt)
fhgtsd   <- sd(fdims$hgt)

hist(fdims$hgt, probability = TRUE,ylim = c(0, 0.06))
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

2. Based on the this plot, does it appear that the data follow a nearly normal distribution?

The data looks pretty normal

Evaluating the normal distribution

qqnorm(fdims$hgt)
qqline(fdims$hgt)

3. Make a normal probability plot of `sim_norm`. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
qqnorm(sim_norm)
qqline(sim_norm)

The data appears very similar with most of the points falling on the line

4. Does the normal probability plot for `fdims$hgt` look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

Yes they do

5. Using the same technique, determine whether or not female weights appear to come from a normal distribution.

female_weight <- fdims$wgt
qplot(female_weight, geom="histogram")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

hist(female_weight, probability = TRUE,ylim = c(0, .05))
x <- 40:110
y <- dnorm(x = x, mean = mean(female_weight), sd = sd(female_weight))
lines(x = x, y = y, col = "blue")

qqnorm(female_weight)
qqline(female_weight, col="Red")

qqnormsim(female_weight)

Yes weight seems pretty normally distributed

Normal probabilities

If we assume that female heights are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm.

1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)

## [1] 0.004434387

Note that the function pnorm gives the area under the normal curve below a given value, q, with a given mean and standard deviation. Since we’re interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.

Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.

sum(fdims$hgt > 182) / length(fdims$hgt)

## [1] 0.003846154

6. Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

How many women are under 152 cm(5ft)
How many women weigh less than 70 KG

pnorm(q = 170, mean = fhgtmean, sd = fhgtsd)

## [1] 0.7833331

sum(fdims$hgt < 170) / length(fdims$hgt)

## [1] 0.7538462

fwtmean <- mean(female_weight)
fhtsd <- sd(female_weight)

pnorm(q = 70, mean = fwtmean, sd = fhtsd)

## [1] 0.8358461

sum(female_weight < 70) / length(female_weight)

## [1] 0.8423077

In order to answer which approximate z score does a better job approximating, we would need to take the difference between the entire set of actual outcomes and predicted distribution outcomes. Looking at individual data points can lead us to a false conclusion, which I noticed as i changed around my theoretical values.
* * *

On Your Own

Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.

b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.

c. The histogram for general age (age) belongs to normal probability plot letter D.

d. The histogram for female chest depth (che.de) belongs to normal probability plot letter A.
Note that normal probability plots C and D have a slight step wise pattern.
Why do you think this is the case?
- Likely the measurements is a discrete dataset.
As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
- The histogram will likely be rightward skewed
- There are outliers but it looks like our dataset would fall within the 68 95 99.7 rule and is likely normally distributed.

qqnorm(fdims$kne.di)
qqline(fdims$kne.di)

min(fdims$kne.di)

## [1] 15.7

hist(fdims$kne.di, probability = TRUE,ylim = c(0, 0.4))
x <- 15:25
x_2 <- min(fdims$kne.di):max(fdims$kne.di)
x_2

## [1] 15.7 16.7 17.7 18.7 19.7 20.7 21.7 22.7 23.7

y <- dnorm(x = x, mean = mean(fdims$kne.di), sd = sd(fdims$kne.di))
lines(x = x, y = y, col = "blue")

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.

The normal distribution

Justin Herman

The Data

1. Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

The normal distribution

2. Based on the this plot, does it appear that the data follow a nearly normal distribution?

Evaluating the normal distribution

3. Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

4. Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

5. Using the same technique, determine whether or not female weights appear to come from a normal distribution.

Normal probabilities

On Your Own

3. Make a normal probability plot of `sim_norm`. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

4. Does the normal probability plot for `fdims$hgt` look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?