This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.
download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
head(bdims)
## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi che.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2 89.5
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5 97.0
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1 97.5
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5 97.0
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5 97.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8 99.9
## wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi wri.gi age
## 1 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5 16.5 21
## 2 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5 17.0 23
## 3 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9 16.9 28
## 4 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0 16.6 23
## 5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4 18.0 22
## 6 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5 16.9 21
## wgt hgt sex
## 1 65.6 174.0 1
## 2 71.8 175.3 1
## 3 80.7 193.5 1
## 4 72.6 186.5 1
## 5 78.8 187.2 1
## 6 74.8 181.5 1
# Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)
hist(fdims$hgt, breaks = 10)
hist(mdims$hgt, breaks = 10)
# The histogram of women's heights is very symmetric, bell-shaped and unimodal. The histogram of men's heights is very symmetric, bell-shaped and unimodal as well. Men's and women's heights would be very well-approximated by normal distributions.
fhgtmean <- mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)
hist(fdims$hgt, probability = TRUE)
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")
Yes, it does. It is very clear that the data in the histogram follows a normal distribution due to its unimodality, bell-shape, and symmetry.
qqnorm(fdims$hgt)
qqline(fdims$hgt)
sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
qqnorm(sim_norm)
qqline(sim_norm)
# No, not all of the points fall on the line. This plot appears less normal than the actual plot of female heights. The observations between the second and third standard deviation above and below the mean deviate from the line of best fit, causing a slight S-shape, which means that there are some outliers on either end of the distribution. One possible reason for this difference is that the actual plot of women's heights rounds numbers to the nearest tenth, causing certain values to be repeated, which is why there is granularity in the actual plot compared to the simulated plot. In other words, the continuity of the female height variable in the simulated plot seems to be causing more deviation in this particular simulation.
qqnormsim(fdims$hgt)
The normal probability plot for fdims$hgt looks very similar to the plots created for the simulated data. Therefore, the plots do provide evidence that the female heights are nearly normal. The first simulated plot in Exercise 3 appeared to be an outlier with respect to the new simulations in Exercise 4 as most of the new simulated plots appear to have far fewer outliers. The fact that the original simulated normal distribution deivated from the actual normal distribution is not surprising considering it was a single simulation of random observations given only a mean and standard deviation. If one were to perform thousands of simulations, the initial simulated plot would probably be within two to three standard deviations of the mean simulated plot assuming that the original normal probability plot’s correlation coefficient was used as the mean.
qqnorm(fdims$wgt)
qqline(fdims$wgt)
qqnormsim(fdims$wgt)
# The normal probability plot for female weights appears very non-linear given the upward bend throughout the entire distribution, which would imply that the data would not be well-approximated or modeled by a normal distribution. The simulated distributions seem to validate this claim of non-linearity although some of the distributions do not exhibit as much non-linearity as others. In almost all of the simulated cases, however, there appear to be extreme outliers toward the top portion of the plot, indicating severe right-skew or upper-tail outliers.
1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387
# Theoretical probability
sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154
# Empirical probability
Question 1: What is the probability that a randomly selected female’s height is between 160 and 170 cm?
pnorm(q = 170, mean = fhgtmean, sd = fhgtsd) - pnorm(q = 160, mean = fhgtmean, sd = fhgtsd)
## [1] 0.5550392
# The probability is 55.50%
Question 2: What is the probability that a randomly selected female’s weight is greater than 80 kg?
fwgtmean <- mean(fdims$wgt)
fwgtsd <- sd(fdims$wgt)
1 - pnorm(q = 80, mean = fwgtmean, sd = fwgtsd)
## [1] 0.02182199
# The probability is 2.18%
The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.
The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.
The histogram for general age (age) belongs to normal probability plot letter D.
The histogram for female chest depth (che.de) belongs to normal probability plot letter A.
This pseudo-step function or stepwise pattern is plots C and D is likely due to repeated values, which causes granularity. For plots involving variables that are discrete integers, as in plot C, this seems to be fairly difficult to avoid but does not really affect the data in any way so can be ignored.
qqnorm(fdims$kne.di)
qqline(fdims$kne.di)
hist(fdims$kne.di)
# Based on the normal probability plot alone, one can infer that the variable is right-skewed since it bends upward toward the top. Moreover, one can also infer that it contains a fair number of extreme upper-tail outliers due to the disconinuity in observations with extreme values from the second to third standard deviation above the mean. The histogram confirms the inferences made from the normal probability plot since the amount of observations drop off singificantly from 20 to 24 and since the mean number of observations is closer to the lower tail at 18.