download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
head(bdims)
## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8
## che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1 89.5 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5
## 2 97.0 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5
## 3 97.5 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9
## 4 97.0 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0
## 5 97.5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4
## 6 99.9 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5
## wri.gi age wgt hgt sex
## 1 16.5 21 65.6 174.0 1
## 2 17.0 23 71.8 175.3 1
## 3 16.9 28 80.7 193.5 1
## 4 16.6 23 72.6 186.5 1
## 5 18.0 22 78.8 187.2 1
## 6 16.9 21 74.8 181.5 1
A Data set that contains body measurements from 247 healthy men and 260 healthy women
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)
2 additional data set were created for men and women seperately because they have different body dimensions.
Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?
fhgtmean <- mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)
The mean and standard deviation are the only 2 pieces of information needed to find a normal distribution. This is just for the woman’s height.
hist(fdims$hgt, probability = TRUE, col = "gray", ylim = c(0, 0.06), xlim = c(140, 190))
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")
A backdrop of the density histogram was created and “lines” was used to overlay the normal probability curve. The X and Y axis were adjusted
Based on the this plot, does it appear that the data follow a nearly normal distribution?
qqnorm(fdims$hgt)
qqline(fdims$hgt)
The Q-Q or Quartile-Quartile plot allows us to determine how close the raw data is to being normally distributed. Deviations from normality lead to deviations from the line.
sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
This is a simulated data set from a normal distribution. *** What am I supposed to be seeing?
Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?
qqnormsim(fdims$hgt)
Compares the original plot to many more functions
Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?
Yes, they the original data set does look similar to the simulated data. There is evidence that the female heights are nearly normal.
Using the same technique, determine whether or not female weights appear to come from a normal distribution.
qqnormsim(fdims$wgt)
What is the probability that a randomly chosen young adult female is taller than 6 feet (about 182 cm)?
1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387
We know that pnorm gives us the area under the normal curve below a given number (or to the left). We must take 1 minus that in order to find the area above that number.
sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154
This allows us to empirically calculate the probability.
Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
2.a. Whats the probability that a randomly chosen young male adult weighs more than 100 kg?
mwgtmean <- mean(mdims$wgt)
mwgtsd <- sd(mdims$wgt)
1 - pnorm(q = 100, mean = mwgtmean, sd = mwgtsd)
## [1] 0.01881231
b.What’s the empirical distribution?
sum(mdims$wgt > 100) / length(mdims$wgt)
## [1] 0.02834008
2.a.Whats the probability that a randomly chosen young female adult weighs more than 80 kg?
fwgtmean <- mean(fdims$wgt)
fwgtsd <- sd(fdims$wgt)
1 - pnorm(q = 80, mean = fwgtmean, sd = fwgtsd)
## [1] 0.02182199
b.What’s the empirical distribution?
sum(fdims$wgt > 80) / length(fdims$wgt)
## [1] 0.04230769
I think this is the case bacause it’s so extremely skewed to the right, so they have far more values that are above normaility.
fkneemean <- mean(fdims$kne.di)
fkneesd <- sd(fdims$kne.di)
hist(fdims$kne.di, probability = TRUE, col = "yellow", ylim = c(0, 0.4), xlim = c(14, 26))
ylim = c(0, 0.06), xlim = c(140, 190)