This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.
download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
Let’s take a quick peek at the first few rows of the data.
head(bdims)
## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8
## che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1 89.5 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5
## 2 97.0 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5
## 3 97.5 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9
## 4 97.0 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0
## 5 97.5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4
## 6 99.9 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5
## wri.gi age wgt hgt sex
## 1 16.5 21 65.6 174.0 1
## 2 17.0 23 71.8 175.3 1
## 3 16.9 28 80.7 193.5 1
## 4 16.6 23 72.6 186.5 1
## 5 18.0 22 78.8 187.2 1
## 6 16.9 21 74.8 181.5 1
Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)
Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?
par(mfrow=c(1,2))
hist(mdims$hgt, xlab = "Men´s Heights",
main="Men´s Heights", xlim = c(150,200), ylim = c(0,80))
abline(v=mean(mdims$hgt), col="blue", lwd=2)
abline(v=median(mdims$hgt), col="red", lwd=2)
hist(fdims$hgt, xlab = "Women´s Heights",
main="Women´s Heights", xlim = c(140,190))
abline(v=mean(fdims$hgt), col="blue", lwd=2)
abline(v=median(fdims$hgt), col="red", lwd=2)
Both distribution are normal and the mean from the men´s height is higher than the womens´.
We’ll be working with women’s heights, so let’s store them as a separate object and then calculate some statistics that will be referenced later.
fhgtmean <- mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)
Next we make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve.
hist(fdims$hgt, probability = TRUE)
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")
After plotting the density histogram with the first command, we create the x- and y-coordinates for the normal curve. We chose the x range as 140 to 190 in order to span the entire range of fheight. To create y, we use dnorm to calculate the density of each of those x-values in a distribution that is normal with mean fhgtmean and standard deviation fhgtsd. The final command draws a curve on the existing plot (the density histogram) by connecting each of the points specified by x and y. The argument col simply sets the color for the line to be drawn. If we left it out, the line would be drawn in black.
The top of the curve is cut off because the limits of the x- and y-axes are set to best fit the histogram. To adjust the y-axis you can add a third argument to the histogram function: ylim = c(0, 0.06).
hist(fdims$hgt, probability = TRUE, ylim = c(0, 0.06), xlim =c(140,190),
xlab = "Women´s Heights", main="Women´s Heights")
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")
abline(v=mean(fdims$hgt), col="red", lwd=2)
Based on the this plot, does it appear that the data follow a nearly normal distribution?
Based on the plot, the data does appears to follow a nearly normal distribution. ## Evaluating The Normal Distribution
An approach to determine if the data appear to be nearly normally distributed involves constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”.
qqnorm(fdims$hgt)
qqline(fdims$hgt)
What do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.
sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
The first argument indicates how many numbers you’d like to generate, which we specify to be the same number of heights in the fdims data set using the length function. The last two arguments determine the mean and standard deviation of the normal distribution from which the simulated sample will be generated. We can take a look at the shape of our simulated data set, sim_norm, as well as its normal probability plot.
sim_norm
## [1] 163.2994 172.4156 167.4664 158.8630 168.4197 159.1773 170.6428
## [8] 153.2758 177.6498 165.8720 176.3588 167.0026 167.1100 168.7718
## [15] 168.7752 166.2957 164.8864 162.4203 161.8295 174.5679 169.1050
## [22] 167.6007 155.4668 165.8326 165.6616 162.1160 166.5379 164.0544
## [29] 162.0224 164.7149 168.6610 156.6436 164.5210 167.8831 147.0004
## [36] 157.9717 161.6474 160.0960 161.8314 162.1508 163.1065 167.4621
## [43] 178.9841 171.0166 161.9763 159.6490 156.5641 177.8935 160.4571
## [50] 163.6145 155.6865 176.8930 164.3012 162.0356 164.3325 170.1055
## [57] 159.7236 171.6853 173.2170 160.2228 161.3531 162.3035 167.5636
## [64] 157.9161 152.8623 164.7203 158.2768 155.7922 162.6256 168.4430
## [71] 164.3607 180.3879 166.2930 171.4699 174.7254 172.9618 160.8397
## [78] 178.6403 173.9983 168.8166 170.5576 167.0476 167.1054 151.1412
## [85] 171.9181 153.4315 173.3901 174.3117 173.2795 168.1504 169.0251
## [92] 158.4575 164.0159 157.1124 170.8206 160.1274 166.7385 159.3059
## [99] 158.9636 168.6491 159.7211 166.2135 165.4458 165.0072 161.2974
## [106] 162.6746 148.4970 153.5782 176.0095 167.9539 177.6260 171.6921
## [113] 175.8509 157.3343 159.1186 166.0022 162.7113 151.6836 166.2604
## [120] 168.5301 162.7542 165.4549 155.7296 168.4171 160.6706 157.0236
## [127] 180.0895 167.3166 161.6238 167.7224 157.9849 171.3849 161.8887
## [134] 163.9841 168.6196 170.0579 168.2692 158.7089 151.6223 158.5541
## [141] 172.2222 161.3734 165.2281 170.3769 169.5444 170.3721 165.7628
## [148] 168.8076 164.0647 150.4725 162.0645 150.7605 164.8024 152.1157
## [155] 159.3679 163.9128 157.0980 161.3424 158.1355 162.9273 159.6803
## [162] 168.9830 172.5490 157.8424 174.3137 153.0364 164.3167 168.8683
## [169] 170.4894 169.7690 160.5136 172.2174 167.8398 161.5778 173.7627
## [176] 162.3653 169.9249 174.6101 166.3108 162.8256 169.7923 172.0248
## [183] 170.5305 161.6726 162.3599 170.4509 156.5936 159.0874 166.5359
## [190] 155.5469 156.9685 164.8081 173.8375 154.0609 164.9381 165.2637
## [197] 166.0180 166.5278 164.9506 150.8852 156.8658 167.1952 164.0639
## [204] 168.2129 159.5380 165.8163 170.0019 160.9639 163.7668 174.7059
## [211] 158.8277 171.6679 163.4156 168.6977 176.4604 177.6577 156.7624
## [218] 163.0994 160.6671 170.7773 170.6626 166.8654 154.2057 179.4545
## [225] 164.7424 155.7031 168.6157 159.4488 165.2219 161.3399 178.4876
## [232] 180.1699 169.2133 164.7567 155.4968 166.0708 165.1036 174.6695
## [239] 153.2979 157.8484 160.1493 170.3957 171.9665 158.4212 158.7379
## [246] 154.6549 165.1497 152.4653 149.9830 159.4510 166.8298 161.8194
## [253] 168.6502 167.3919 156.9616 162.3607 160.6594 166.1248 168.3507
## [260] 162.4817
qqnorm(sim_norm)
qqline(sim_norm)
Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?
qqnorm(sim_norm)
qqline(sim_norm)
Not all the points fall on the generated line. The general trend of the simulated data is very close to the real data.
Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function. It may be helpful to click the zoom button in the plot window.
qqnormsim(fdims$hgt)
Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?
The plot for fdims$hgt does look similar to the plots created for the simulated data. Thus, there is evidence that the female heights are nearly normal.
Using the same technique, determine whether or not female weights appear to come from a normal distribution.
qqnormsim(fdims$wgt)
Female weights appears to come from a general normal distribution with some values dispersing from the extremes.
If we assume that female heights are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm.
1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387
Note that the function pnorm gives the area under the normal curve below a given value, q, with a given mean and standard deviation. Since we’re interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.
Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.
sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154
Although the probabilities are not exactly the same, they are reasonably close. The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be.
Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
1 - pnorm(q = 170, mean = fhgtmean, sd = fhgtsd)
## [1] 0.2166669
sum(fdims$hgt > 170) / length(fdims$hgt)
## [1] 0.2230769
fwgtmean <- mean(fdims$wgt)
fwgtsd <- sd(fdims$wgt)
pnorm(q = 65.0, mean = fwgtmean, sd = fwgtsd)
## [1] 0.6763603
sum(fdims$wgt < 65.0) / length(fdims$wgt)
## [1] 0.7384615
The variable height had a closer agreement between the two methods than the variable weight.
The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.
The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.
The histogram for general age (age) belongs to normal probability plot letter D.
The histogram for female chest depth (che.de) belongs to normal probability plot letter A.
The stepwise patter is due to the repetition of same values due to rounding of measurements.
fknemean <- mean(fdims$kne.di)
fknesd <- sd(fdims$kne.di)
qqnormsim(fdims$kne.di)
hist(fdims$kne.di, probability = TRUE, breaks = 10, ylim = c(0, 0.4), xlim =c(15,25),
xlab = "Women´s Knee Diameter", main="Women´s Knee Diameter")
x <- 15:25
y <- dnorm(x = x, mean = fknemean, sd = fknesd)
lines(x = x, y = y, col = "blue")
abline(v=mean(fdims$kne.di), col="red", lwd=2)
shapiro.test(fdims$kne.di)
##
## Shapiro-Wilk normality test
##
## data: fdims$kne.di
## W = 0.94214, p-value = 1.345e-08
The variable is right skewed.