Loading Data and Packages

load("more/bdims.RData")
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)

Exercise

1. Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

hist(mdims$hgt, xlab = "Men's Height in cm", col = "blue")

hist(fdims$hgt, xlab = "Women's Height in cm", col = "pink")

The men’s height distribution is more symetric than women’s. The women’s height distribution is more spread out than men’s. Both distribution are unimodal and bell-shaped.

2. Based on the this plot, does it appear that the data follow a nearly normal distribution?

Base on the plot, the women’s height distribution can be nearly appoximated by normal distribution. More analysis is needed to be certain though.

3. Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

fhgtmean = mean(fdims$hgt)
fhgtsd = sd(fdims$hgt)
sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
hist(sim_norm)

qqnorm(sim_norm)
qqline(sim_norm)

Not all the points fall on the line. This plot is similar to the real data, where the data closely follow the normal distribution in the middle, and deviate somewhat at the tail ends.

4. Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

Yes. Again, in the center, the simulated data points follow the normal line closely in the middle, and deviates at the tail ends. This is similar to the women’s height distribution. This provides evidence that the female heights are nearly normal.

5. Using the same technique, determine whether or not female weights appear to come from a normal distribution.

hist(fdims$wgt)

qqnorm(fdims$wgt)
qqline(fdims$wgt)

fwgtmean = mean(fdims$wgt)
fwgtsd = sd(fdims$wgt)
sim_norm <- rnorm(n = length(fdims$wgt), mean = fwgtmean, sd = fwgtsd)
qqnorm(sim_norm)
qqline(sim_norm)

It appears that the female weights distribution is heavily right skewed, meaning it has extreme outliers in the right side. Too many points are falling outside the normal line. The distribution cannot be nearly approximated by normal distribution.

6. Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

Question 1 - What is the probabily that a randomly selected female has a height between 125cm and 175cm?

pnorm(175, fhgtmean, fhgtsd) - pnorm(125, fhgtmean, fhgtsd)
## [1] 0.9391272
sum(fdims$hgt > 125 & fdims$hgt < 175) / length(fdims$hgt)
## [1] 0.9153846

The results, as you can see, are reasonably close. This is because the women’s height can be approximated with normal distribution, particularly in the center portion of the data.

Question 2 - What is the probabily that a randomly selected female has a weight greater than 80 kg?

pnorm(80, fwgtmean, fwgtsd, lower.tail = FALSE)
## [1] 0.02182199
sum(fdims$wgt > 80) / length(fdims$wgt)
## [1] 0.04230769

The results are very different. This is because the women’s weight is far from normally distributed.

On Your Own

The stepwise pattern is due to the variable being discrete instead of continous. For continous variable, the observations form a line or curve. For discrete variable, the observation typically jump from one discrete variable to another. Plot C corresponds to elbow diamater and Plot D corresponds to age. For age, it is definately measure as a discrete variable. For elbow diameter, the measurement is somewhat discrete, if we look at the range of possible values of the variable.

length(fdims$age)/length(unique(fdims$age))
## [1] 7.027027
length(fdims$elb.di)/length(unique(fdims$elb.di))
## [1] 6.190476
length(fdims$che.de)/length(unique(fdims$che.de))
## [1] 3.880597
length(fdims$bii.di)/length(unique(fdims$bii.di))
## [1] 3.421053

Here, to compare how “discrete” the variable is, I divide the number of samples in the variable by all unique values in the data set. If a variable is more continous, the sampe will have more unique values, thus the calculated value will be lower. On the other hand, if a variable is discrete, it will have comparably less unique values, thus higher value calculated. Comparing the value calculated, it is apparent that the age and elbow diameter variables are more discrete than other two variables.

qqnorm(fdims$kne.di)
qqline(fdims$kne.di)

The variable is right skewed. Most points fall on the low end with fewer points on the high end.

hist(fdims$kne.di)

As you can see, the distribution is right skewed.