load("more/bdims.RData")
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)
1. Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?
hist(mdims$hgt, xlab = "Men's Height in cm", col = "blue")
hist(fdims$hgt, xlab = "Women's Height in cm", col = "pink")
The men’s height distribution is more symetric than women’s. The women’s height distribution is more spread out than men’s. Both distribution are unimodal and bell-shaped.
2. Based on the this plot, does it appear that the data follow a nearly normal distribution?
Base on the plot, the women’s height distribution can be nearly appoximated by normal distribution. More analysis is needed to be certain though.
3. Make a normal probability plot of sim_norm
. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?
fhgtmean = mean(fdims$hgt)
fhgtsd = sd(fdims$hgt)
sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
hist(sim_norm)
qqnorm(sim_norm)
qqline(sim_norm)
Not all the points fall on the line. This plot is similar to the real data, where the data closely follow the normal distribution in the middle, and deviate somewhat at the tail ends.
4. Does the normal probability plot for fdims$hgt
look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?
Yes. Again, in the center, the simulated data points follow the normal line closely in the middle, and deviates at the tail ends. This is similar to the women’s height distribution. This provides evidence that the female heights are nearly normal.
5. Using the same technique, determine whether or not female weights appear to come from a normal distribution.
hist(fdims$wgt)
qqnorm(fdims$wgt)
qqline(fdims$wgt)
fwgtmean = mean(fdims$wgt)
fwgtsd = sd(fdims$wgt)
sim_norm <- rnorm(n = length(fdims$wgt), mean = fwgtmean, sd = fwgtsd)
qqnorm(sim_norm)
qqline(sim_norm)
It appears that the female weights distribution is heavily right skewed, meaning it has extreme outliers in the right side. Too many points are falling outside the normal line. The distribution cannot be nearly approximated by normal distribution.
6. Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
Question 1 - What is the probabily that a randomly selected female has a height between 125cm and 175cm?
pnorm(175, fhgtmean, fhgtsd) - pnorm(125, fhgtmean, fhgtsd)
## [1] 0.9391272
sum(fdims$hgt > 125 & fdims$hgt < 175) / length(fdims$hgt)
## [1] 0.9153846
The results, as you can see, are reasonably close. This is because the women’s height can be approximated with normal distribution, particularly in the center portion of the data.
Question 2 - What is the probabily that a randomly selected female has a weight greater than 80 kg?
pnorm(80, fwgtmean, fwgtsd, lower.tail = FALSE)
## [1] 0.02182199
sum(fdims$wgt > 80) / length(fdims$wgt)
## [1] 0.04230769
The results are very different. This is because the women’s weight is far from normally distributed.
Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.
a. The histogram for female biiliac (pelvic) diameter (bii.di
) belongs to normal probability plot letter B.
b. The histogram for female elbow diameter (elb.di
) belongs to normal probability plot letter C.
c. The histogram for general age (age
) belongs to normal probability plot letter D.
d. The histogram for female chest depth (che.de
) belongs to normal probability plot letter A.
Note that normal probability plots C and D have a slight stepwise pattern.
Why do you think this is the case?
The stepwise pattern is due to the variable being discrete instead of continous. For continous variable, the observations form a line or curve. For discrete variable, the observation typically jump from one discrete variable to another. Plot C corresponds to elbow diamater and Plot D corresponds to age. For age, it is definately measure as a discrete variable. For elbow diameter, the measurement is somewhat discrete, if we look at the range of possible values of the variable.
length(fdims$age)/length(unique(fdims$age))
## [1] 7.027027
length(fdims$elb.di)/length(unique(fdims$elb.di))
## [1] 6.190476
length(fdims$che.de)/length(unique(fdims$che.de))
## [1] 3.880597
length(fdims$bii.di)/length(unique(fdims$bii.di))
## [1] 3.421053
Here, to compare how “discrete” the variable is, I divide the number of samples in the variable by all unique values in the data set. If a variable is more continous, the sampe will have more unique values, thus the calculated value will be lower. On the other hand, if a variable is discrete, it will have comparably less unique values, thus higher value calculated. Comparing the value calculated, it is apparent that the age and elbow diameter variables are more discrete than other two variables.
kne.di
). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.qqnorm(fdims$kne.di)
qqline(fdims$kne.di)
The variable is right skewed. Most points fall on the low end with fewer points on the high end.
hist(fdims$kne.di)
As you can see, the distribution is right skewed.