Load libraries

library(dplyr)
library(knitr)

Load data

load("more/bdims.RData")

Setup variables

mdims <- subset(bdims, sex == 1) %>% dplyr::select(wgt, hgt)
fdims <- subset(bdims, sex == 0) %>% dplyr::select(wgt, hgt)

Lab 3: Normal Distribution

Exercise 1:

Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

  • Male height: mean is 177.74 cm, standard deviation is 7.18 cm.
  • Female height: mean is 164.87 cm, standard deviation is 6.54
  • Both male and female histogram resembles a distribution that is normal or bell-shaped. Overall, both look symmetric and unimodal.
hist(mdims$hgt, xlab="Men's Height")

hist(fdims$hgt, xlab="Women's Height")

mhgtmean <- mean(mdims$hgt) #177.7453
mhgtsd <- sd(mdims$hgt)   #7.183629

fhgtmean <- mean(fdims$hgt) #164.8723
fhgtsd <- sd(fdims$hgt)   #6.544602

Exercise 2:

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Yes. Based on the comparison of between the frequency history and density historgram, it appears that the data follow a nearly normal distribution.

hist(fdims$hgt, probability = TRUE, ylim = c(0, 0.06))
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

Exercise 3:

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

The normal probability plot of the simulated do not fall all on the line. The data towards the tail end deviate more significantly from the line than compared to the points towards the center. As illustrated below, the normal probability plot of the actual data looks somewhat similar to the simulated one. The actual plot shows more deviation towards the middle.

Normal probability plot of simulated data:

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
qqnorm(sim_norm)
qqline(sim_norm)

Normal probability plot of actual data:

qqnorm(fdims$hgt)
qqline(fdims$hgt)

Exercise 4:

Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

Below is several normal probability plots based on 9 simulations. Variability towards the tail ends seems to be a common feature shared by the other simulations. In this particular simuation, the one in row 2, column 3 shows variability towards the middle somewhat similar to the normal probability plot of the actual data. I’m not sure if this is “evidence” that the height of females follow a normal distribution (generalizable), but I can tell that for this particular data set, it is close to a normal distribution.

set.seed(41)
qqnormsim(fdims$hgt)

Exercise 5:

Using the same technique, determine whether or not female weights appear to come from a normal distribution.

Based on the analysis below, I would say that female weights appear to come from a normal distribution. The distribution is skewed to the right. This data set may have some outliers with points that are above 90 kg. None of the simulations show points beyond 90 kg limit. So this may suggest that a few of the data points are outliers.

Frequency History vs. Density Histogram:

The normal distribution line overlayed on top of the frequency histogram does cover the highest point of the histogram. I’m not sure what the significance of this is. But it other than that, the blue line fits the histogram nicely.

fwgtmean = mean(fdims$wgt)
fwgtsd = sd(fdims$wgt)

hist(fdims$wgt, probability = TRUE, ylim = c(0, 0.06))
x <- 40:110
y <- dnorm(x = x, mean = fwgtmean, sd = fwgtsd)
lines(x = x, y = y, col = "blue")

Normal probability plot of actual weight:

qqnorm(fdims$wgt)
qqline(fdims$wgt)

Normal probability plot of simulated normally distributed data with same mean, sd:

I ran the the q-q plot simulation a few times while setting a different seed each time. In this particular case, the simulation located on row 1, column 2 looks somewhat similar to the probability normal plot of the actual data (female weight). None of the simulations so far I’ve seen generated points above the 100 kg mark. So far the cut off has been around 90 kg from what I’ve observed.

#sim_norm <- rnorm(n = length(fdims$wgt), mean = fwgtmean, sd = fwgtsd)
#qqnorm(sim_norm)
#qqline(sim_norm)
set.seed(91)
qqnormsim(fdims$wgt)

Exercise 6

(1) What is the probability of a female being less than 5 ft tall (152.4 cm)

The theoretical probability that a female is less than 5ft is 2.8% assuming normal distribution. In the data set, the probablity is 2.7%. So they are very close.

pnorm(q=152.4, mean=fhgtmean, sd=fhgtsd) #0.028342
## [1] 0.028342
length(fdims$hgt[fdims$hgt < 152.4])/length(fdims$hgt) #0.02692308
## [1] 0.02692308

(2) What is the probability of a female weight between 150 lbs. (68.0 kg) to 160 lbs (72.6 kb)?

The theoretical probability (assuming normal distribution) that a female’s weight is between 150 lbs. and 160 lbs. is about 11.5%. However, in this particular data set, the number of females that fall within this range is about 7.7% only. This is a difference of about 3.8 percent. I’m not sure how significant this difference is.

x <- pnorm(q=68.0, mean=fwgtmean, sd=fwgtsd) #0.7792121
y <- pnorm(q=72.6, mean=fwgtmean, sd=fwgtsd) #0.8939697
theoretical <- y - x #0.1147576 (theoretical)
actual <- length(fdims$wgt[fdims$wgt < 72.6 & fdims$wgt > 68.0])/length(fdims$wgt) #0.07692308
theoretical - actual
## [1] 0.03783453

On Your Own

PLEASE READ:

I cannot generate histogram or qqplots when I knit. I get some error (see demo below). However, I am able to run the functions just fine directly in R Studio environment. I tried to do some research on this issue, but I was unable to find a fix. As a workaround, I saved the graphics and attached them on this file. Demo of error seen: https://screencast-o-matic.com/watch/cFef36D62X

1) Now let’s consider some of the other variables in the body dimensions dataset. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

(a) The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter ____.

Answer: Normal Q-Q Plot B

NOTE:

#I have to comment the codes that generate the histogram and qq plots because these functions
#generate an error. However, I am able to run the functions fine directly in R Studio. 
#hist(fdims$bii.di)
#qqnorm(fdims$bii.di)
#qqline(fdims$bii.di)
include_graphics("./fdims_bii.di_histogram.png")

include_graphics("./fdims_bii.di_QQplot.png")

(b) The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter ____.

Answer: Normal Q-Q Plot C

#generate an error. However, I am able to run the functions fine directly in R Studio. 
#hist(fdims$elb.di)
#qqnorm(fdims$elb.di)
#qqline(fdims$elb.di)
include_graphics("./fdims_elb.di_histogram.png")

include_graphics("./fdims_elb.di_QQplot.png")

(c) The histogram for general age (age) belongs to normal probability plot letter ____.

Answer: Normal Q-Q Plot D

#generate an error. However, I am able to run the functions fine directly in R Studio. 
#hist(fdims$age)
#qqnorm(fdims$age)
#qqline(fdims$age)
include_graphics("./fdims_age_histogram.png")

include_graphics("./fdims_age_QQplot.png")

(d) The histogram for female chest depth (che.de) belongs to normal probability plot letter ____.

Answer: Normal Q-Q Plot A

#generate an error. However, I am able to run the functions fine directly in R Studio. 
#hist(fdims$che.de)
#qqnorm(fdims$che.de)
#qqline(fdims$che.de)
include_graphics("./fdims_che.de_histogram.png")

include_graphics("./fdims_che.de_QQplot.png")

2) Note that normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case?

I searched for any hints of the reason behind why a q-q plot would have a stepwise pattern. I found this page below that talks about why this happens. It says that a stepwise pattern occurs when a variable is discrete. Plot D is the Q-Q plot for age, and this stepwise pattern is prominent in this plot.

https://stats.stackexchange.com/questions/113387/can-i-still-interpret-a-q-q-plot-that-uses-discrete-rounded-data

3) As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

The variable kne.di is right skewed. This is confirmed by the histogram.

#generate an error. However, I am able to run the functions fine directly in R Studio. 
#hist(fdims$kne.di)
#qqnorm(fdims$kne.di)
#qqline(fdims$kne.di)
include_graphics("./fdims_kne.di_histogram.png")

include_graphics("./fdims_kne.di_QQplot.png")