download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
head(bdims)
##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri.gi age  wgt   hgt sex
## 1   16.5  21 65.6 174.0   1
## 2   17.0  23 71.8 175.3   1
## 3   16.9  28 80.7 193.5   1
## 4   16.6  23 72.6 186.5   1
## 5   18.0  22 78.8 187.2   1
## 6   16.9  21 74.8 181.5   1
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)

Exercise 1: Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

mhgt <- mdims$hgt
fhgt <- fdims$hgt

par(mfrow = c(1,2))
hist(mhgt, main = "Male Heights", xlim = c(155,200), ylim =c(0,80))
hist(fhgt, main = "Female Heights", xlim = c(155,200), ylim=c(0,80))

The distribution of the male heights look more normal whereas the female heights look more right skewed.

fhgtmean <- mean(fdims$hgt)
fhgtsd   <- sd(fdims$hgt)
hist(fdims$hgt, probability = TRUE, ylim=c(0,0.06))
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue", lwd=2)

Exercise 2: Based on this plot, does it appear that the data follow a nearly normal distribution?

Although some of the bars spill outside of the curve, the distribution as a whole does seem to follow a nearly normal distribution.

qqnorm(fdims$hgt)
qqline(fdims$hgt)

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)

Exercise 3: Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

qqnorm(sim_norm)
qqline(sim_norm)

The simulated plot aligns a lot better to the line, however, some of the points at the tails deviate.

qqnormsim(fdims$hgt)

There are more deviations along the middle portion of the plot for fdims$hgt, but besides that, all of the plots are nearly the same with more of the alignment occuring in the center of the plot and most of the deviations occuring at the tails of the plot. Meaning, these plots provide evidence that the female heights are nearly normal.

Exercise 4: Using the same technique, determine whether or not female weights appear to come from a normal distribution.

fwgtmean <- mean(fdims$wgt)
fwgtsd   <- sd(fdims$wgt)
fdims$wgt
##   [1]  51.6  59.0  49.2  63.0  53.6  59.0  47.6  69.8  66.8  75.2  55.2
##  [12]  54.2  62.5  42.0  50.0  49.8  49.2  73.2  47.8  68.8  50.6  82.5
##  [23]  57.2  87.8  72.8  54.5  59.8  67.3  67.8  47.0  46.2  55.0  83.0
##  [34]  54.4  45.8  53.6  73.2  52.1  67.9  56.6  62.3  58.5  54.5  50.2
##  [45]  60.3  58.3  56.2  50.2  72.9  59.8  61.0  69.1  55.9  46.5  54.3
##  [56]  54.8  60.7  60.0  62.0  60.3  52.7  74.3  62.0  73.1  80.0  54.7
##  [67]  53.2  75.7  61.1  55.7  48.7  52.3  50.0  59.3  62.5  55.7  54.8
##  [78]  45.9  70.6  67.2  69.4  58.2  64.8  71.6  52.8  59.8  49.0  50.0
##  [89]  69.2  55.9  63.4  58.2  58.6  45.7  52.2  48.6  57.8  55.6  66.8
## [100]  59.4  53.6  73.2  53.4  69.0  58.4  56.2  70.6  59.8  72.0  65.2
## [111]  56.6 105.2  51.8  63.4  59.0  47.6  63.0  55.2  45.0  54.0  50.2
## [122]  60.2  44.8  58.8  56.4  62.0  49.2  67.2  53.8  54.4  58.0  59.8
## [133]  54.8  43.2  60.5  46.4  64.4  48.8  62.2  55.5  57.8  54.6  59.2
## [144]  52.7  53.2  64.5  51.8  56.0  63.6  63.2  59.5  56.8  64.1  50.0
## [155]  72.3  55.0  55.9  60.4  69.1  84.5  55.9  55.5  69.5  76.4  61.4
## [166]  65.9  58.6  66.8  56.6  58.6  55.9  59.1  81.8  70.7  56.8  60.0
## [177]  58.2  72.7  54.1  49.1  75.9  55.0  57.3  55.0  65.5  65.5  48.6
## [188]  58.6  63.6  55.2  62.7  56.6  53.9  63.2  73.6  62.0  63.6  53.2
## [199]  53.4  55.0  70.5  54.5  54.5  55.9  59.0  63.6  54.5  47.3  67.7
## [210]  80.9  70.5  60.9  63.6  54.5  59.1  70.5  52.7  62.7  86.3  66.4
## [221]  67.3  63.0  73.6  62.3  57.7  55.4 104.1  55.5  77.3  80.5  64.5
## [232]  72.3  61.4  58.2  81.8  63.6  53.4  54.5  53.6  60.0  73.6  61.4
## [243]  55.5  63.6  60.9  60.0  46.8  57.3  64.1  63.6  67.3  75.5  68.2
## [254]  61.4  76.8  71.8  55.5  48.6  66.4  67.3
hist(fdims$wgt, probability = TRUE)
x <- 40:106
y <- dnorm(x = x, mean = fwgtmean, sd = fwgtsd)
lines(x = x, y = y, col = "red", lwd=2)

qqnorm(fdims$wgt)
qqline(fdims$wgt)

sim_normW <- rnorm(n = length(fdims$wgt), mean = fwgtmean, sd = fwgtsd)
qqnorm(sim_normW)
qqline(sim_normW)

qqnormsim(fdims$wgt)

With this technique, female weights do not appear to come from a normal distribution. The initial plot that was created indicated right skewedness, and compared to the simulated QQ plots, the fwgts plot showed a lot more deviation then the normal QQ plots.

1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387
sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154

Exercise 6: Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

1. What is the probability that a randomly chosen female is between 5’2" (about 157.5 cm) and 5’8" (about 172.7 cm)?

pnorm(q= 172.7, mean =fhgtmean, sd=fhgtsd) - (pnorm(q=157.5, mean=fhgtmean, sd=fhgtsd))
## [1] 0.7541791
sum((fdims$hgt < 172.7) - (fdims$hgt<157.5)) / length(fdims$hgt)
## [1] 0.7423077

2. What is the probability that a randomly chosen female weighs less than 130 lbs (58.97kg)?

pnorm(q=58.97, mean=fwgtmean, sd=fwgtsd)
## [1] 0.4326803
sum(fdims$wgt < 58.97 / length(fdims$wgt))
## [1] 0

The height variable had a closer agreement between the two methods.

On Your Own

1a. B

1b. C

1c. D

1d. A

2. This is probably because there are a lot of repeated values especially in Histogram C with some data points having a frequency of 150.

3.

qqnorm(fdims$kne.di)
qqline(fdims$kne.di)

Based on the normal probability plot, female knee diameter seems to have a right skewed distribution.

hist(fdims$kne.di)