library(tidyverse)
library(openintro)
download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
set.seed(314159)shape: the male heights appear fairly normally distributed, but the
female heights are more questionable.
center: the male heights have a median of 177.8, whereas the female
heights have a median of 164.5.
spread: the male heights have a larger range than the female
heights.
head(bdims)## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi che.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2 89.5
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5 97.0
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1 97.5
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5 97.0
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5 97.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8 99.9
## wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi wri.gi age
## 1 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5 16.5 21
## 2 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5 17.0 23
## 3 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9 16.9 28
## 4 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0 16.6 23
## 5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4 18.0 22
## 6 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5 16.9 21
## wgt hgt sex
## 1 65.6 174.0 1
## 2 71.8 175.3 1
## 3 80.7 193.5 1
## 4 72.6 186.5 1
## 5 78.8 187.2 1
## 6 74.8 181.5 1
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)
hist(mdims$hgt)hist(fdims$hgt)summary(mdims$hgt)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 157.2 172.9 177.8 177.7 182.7 198.1
summary(fdims$hgt)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 147.2 160.0 164.5 164.9 169.5 182.9
The data deviates from the curve significantly in the 155-170 range. It is hard to tell whether the data is normally distributed by eyeballing.
fhgtmean <- mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)
hist(fdims$hgt, probability = TRUE,ylim=c(0,0.06))
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")Not all of the points fall on the line, in fact, a lot of the points fall away from the line at the tails. This simulation Q-Q plot is very similar to the real data’s Q-Q plot.
qqnorm(fdims$hgt)
qqline(fdims$hgt)sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
qqnorm(sim_norm)
qqline(sim_norm)The actual and simulated plots look very similar, and while I do not believe that this on its own provides conclusive evidence that the female heights are normally distributed, I believe that it provides a strong frame of reference, by which I feel more confident saying that the female heights are nearly normally distributed.
When comparing the Q-Q plot of the actual data to the simulated data, it is clear that there are significant outliers in the actual data and a visible skew-right pattern. I believe that the data is not normally distributed for weight, but we could be more certain by performing the Shapiro-Wilk’s test.
hist(fdims$wgt)qqnorm(fdims$wgt)
qqline(fdims$wgt)sim_norm_wgt <- rnorm(n = length(fdims$wgt), mean = mean(fdims$wgt), sd = sd(fdims$wgt))
qqnorm(sim_norm_wgt)
qqline(sim_norm_wgt)What percent of the female population is more than 155cm tall? (sim
then actual)
What percent of the feamle population is less than 70kg? (sim then
actual)
There is less discrepancy between the actual and simulated heights than between the actual and simulated weights.
1-pnorm(155, mean = mean(sim_norm), sd = sd(sim_norm))## [1] 0.9320381
1-pnorm(155, mean = mean(fdims$hgt), sd = sd(fdims$hgt))## [1] 0.9342823
pnorm(70, mean = mean(sim_norm_wgt), sd = sd(sim_norm_wgt))## [1] 0.8301079
pnorm(70, mean = mean(fdims$wgt), sd = sd(fdims$wgt))## [1] 0.8358461
The biiliac diameter histogram belongs to Q-Q plot B because of the left tail.
The elbow diameter histogram belongs to Q-Q plot C because it is neither of the skew right plots.
The general age histogram belongs to Q-Q plot D, because it is the only one left that has a maximum less than +4 SD for sample quantities.
The chest depth histogram corresponds to Q-Q plot A (only one left).
This is due to granularity in measurement, which makes sense for age, and likely has something to do with the measurement method or instrument used for chest depth.
The data for female knee diameter is skewed right.
hist(fdims$kne.di)qqnorm(fdims$kne.di)
qqline(fdims$kne.di)