Body Dimensions: This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.
load("more/bdims.RData")Let’s take a quick peek at the first few rows of the data.
head(bdims)## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8
## che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1 89.5 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5
## 2 97.0 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5
## 3 97.5 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9
## 4 97.0 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0
## 5 97.5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4
## 6 99.9 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5
## wri.gi age wgt hgt sex
## 1 16.5 21 65.6 174.0 1
## 2 17.0 23 71.8 175.3 1
## 3 16.9 28 80.7 193.5 1
## 4 16.6 23 72.6 186.5 1
## 5 18.0 22 78.8 187.2 1
## 6 16.9 21 74.8 181.5 1
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)#mens heights
library(ggplot2)
qplot(mdims$hgt, binwidth = 5)#womens heights
qplot(fdims$hgt, binwidth = 5)Both appear relatively normal and bell-shaped.
To see how accurate that description is, we can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution.
fhgtmean <- mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)Next we make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve.
hist(fdims$hgt, probability = TRUE, ylim = c(0, 0.06))
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")Yes, it seems approximately normal
An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”.
qqnorm(fdims$hgt)
qqline(fdims$hgt)A data set that is nearly normal will result in a probability plot where the points closely follow the line. Any deviations from normality leads to deviations of these points from the line.
What do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.
sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?qqnorm(sim_norm)
qqline(sim_norm)Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function.
qqnormsim(fdims$hgt)fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?While there is some more variation on the ends of female height data, in general it appears as though female heights are nearly normal.
qqnorm(fdims$wgt)
qqline(fdims$wgt)The distribution for weight does not seem as normal however.
Once we decide that a random variable is approximately normal, we can answer all sorts of questions about that variable related to probability.
“What is the probability that a randomly chosen young adult female is taller than 6 feet (about 182 cm)?” If we assume that female heights are normally distributed, we can find this probability by calculating a Z score and consulting a Z table. In R, this is done in one step with the function pnorm.
1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)## [1] 0.004434387
Note that the function pnorm gives the area under the normal curve below a given value, q, with a given mean and standard deviation. Since we’re interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.
Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.
sum(fdims$hgt > 182) / length(fdims$hgt)## [1] 0.003846154
What is the probability that a random chosen young adult female is shorter than 5’3’’ (~160 cm)?
pnorm(q = 160, mean = fhgtmean, sd = fhgtsd)## [1] 0.2282939
sum(fdims$hgt < 160) / length(fdims$hgt)## [1] 0.1923077
What is the probability that a random chosen young adult female is heavier than 54kg?
fwgtmean <- mean(fdims$wgt)
fwgtsd <- sd(fdims$wgt)
pnorm(q = 54, mean = fwgtmean, sd = fwgtsd)## [1] 0.2462249
sum(fdims$wgt < 54) / length(fdims$wgt)## [1] 0.2192308
Height is closer in general
Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.
a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter ____.
Plot B
**b.** The histogram for female elbow diameter (`elb.di`) belongs to normal probability plot letter ____.
Plot C
**c.** The histogram for general age (`age`) belongs to normal probability plot letter ____.
Plot D
**d.** The histogram for female chest depth (`che.de`) belongs to normal probability plot letter ____.
Plot A
Likely because of the integer values provided in the data set. Age was given in integers making the jumps a bit more obvious. Perhaps for elbow diameters, many people cluster around the same diameter, since it functions in the same way for many people.
kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.From the probability plot it appears as though there are less values as the quantites increase. Therefore, it appears as through it is skewed right.
qqnorm(fdims$kne.di)
qqline(fdims$kne.di)hist(fdims$kne.di)histQQmatch
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.