This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.
load("more/bdims.RData")
Let’s take a quick peek at the first few rows of the data.
head(bdims)
## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8
## che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1 89.5 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5
## 2 97.0 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5
## 3 97.5 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9
## 4 97.0 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0
## 5 97.5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4
## 6 99.9 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5
## wri.gi age wgt hgt sex
## 1 16.5 21 65.6 174.0 1
## 2 17.0 23 71.8 175.3 1
## 3 16.9 28 80.7 193.5 1
## 4 16.6 23 72.6 186.5 1
## 5 18.0 22 78.8 187.2 1
## 6 16.9 21 74.8 181.5 1
.
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)
Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?
Answer:
hist(mdims$hgt, main = "Male Height Distribution", breaks = 10, xlab = "Male Height (cm)", col = "blue")
summary(mdims$hgt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 157.2 172.9 177.8 177.7 182.7 198.1
hist(mdims$hgt, main = "Female Height Distribution", breaks = 10, xlab = "Female Height (cm)", col = "pink")
summary(fdims$hgt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 147.2 160.0 164.5 164.9 169.5 182.9
The shape of the two distributions are quite similar. Both male and female heights show a symmetric and unimodal distribution.
fhgtmean <- mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)
hist(fdims$hgt, probability = TRUE)
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")
The top of the curve is cut off because the limits of the x- and y-axes are set to best fit the histogram. To adjust the y-axis you can add a third argument to the histogram function: ylim = c(0, 0.06)
.
Answer:Based on this plot , it does appear that the data approximately follows a normal distribution
qqnorm(fdims$hgt)
qqline(fdims$hgt)
A useful way to address this question is to rephrase it as: what do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm
.
sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
sim_norm
. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?Answer:
qqnorm(sim_norm)
qqline(sim_norm)
The points don’t fall exactly on a line, but they’re very close. The largest deviations come in the upper and lower part of the distribution.
qqnormsim(fdims$hgt)
Does the normal probability plot for fdims$hgt
look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?
Answer: Eventhough there is some variation on the ends of female height data, the probability plot for female height is similar to simulated data ploats and the plots do show that female heights are nearly normal.
Using the same technique, determine whether or not female weights appear to come from a normal distribution.
Answer:
qqnorm(fdims$wgt)
qqline(fdims$wgt)
The data for the womens’ weight may not be normal due to some curvature in the shape that suggests a longer right tail that we’d expect from nearly normal data and also shows two notable outliers
1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387
sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154
Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
a)What percent of females have a height of 165cm or less?
1 - pnorm(q = 165, mean = fhgtmean, sd = fhgtsd)
## [1] 0.4922167
sum(fdims$hgt < 165) / length(fdims$hgt)
## [1] 0.5076923
b)What percent of females have a weight of 60kg or more?
wtmean=mean(fdims$wgt)
wtsd=sd(fdims$wgt)
1 - pnorm(q = 60, mean = wtmean, sd = wtsd)
## [1] 0.524893
sum(fdims$wgt > 60) / length(fdims$wgt)
## [1] 0.4384615
Height is the variable that had a closer agreement between the two methods used.
**a.** The histogram for female biiliac (pelvic) diameter (`bii.di`) belongs
to normal probability plot letter _B___.
qqnorm(fdims$bii.di)
qqline(fdims$bii.di)
**b.** The histogram for female elbow diameter (`elb.di`) belongs to normal
probability plot letter ___C_.
```r
qqnorm(fdims$elb.di)
qqline(fdims$elb.di)
```
<img src="Qazim-Mulleti-normal_distribution_files/figure-html/unnamed-chunk-8-1.png" width="672" />
**c.** The histogram for general age (`age`) belongs to normal probability
plot letter __D__.
```r
qqnorm(fdims$age)
qqline(fdims$age)
```
<img src="Qazim-Mulleti-normal_distribution_files/figure-html/unnamed-chunk-9-1.png" width="672" />
**d.** The histogram for female chest depth (`che.de`) belongs to normal
probability plot letter _A___.
```r
qqnorm(fdims$che.de)
qqline(fdims$che.de)
```
<img src="Qazim-Mulleti-normal_distribution_files/figure-html/unnamed-chunk-10-1.png" width="672" />
Note that normal probability plots C and D have a slight stepwise pattern.
Why do you think this is the case?
I think it is because age and chest were recored as integers,while elbow diameter and pelvic were not;therefore stepwise patterns are more prominent in discrete data.
As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di
). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
qqnorm(fdims$kne.di)
qqline(fdims$kne.di)
hist(fdims$kne.di, breaks = 20)
The variable appears to be right skewed which is confirmed by the histogram as well.
histQQmatch