The Normal Distribution

load("more/bdims.RData")
head(bdims)
##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri.gi age  wgt   hgt sex
## 1   16.5  21 65.6 174.0   1
## 2   17.0  23 71.8 175.3   1
## 3   16.9  28 80.7 193.5   1
## 4   16.6  23 72.6 186.5   1
## 5   18.0  22 78.8 187.2   1
## 6   16.9  21 74.8 181.5   1

Exercise 1

Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

mens <- subset(bdims, sex == 1)
females <- subset(bdims, sex == 0)

hist(mens$hgt, xlab = "Height of Men in cm", main = "Histogram For Men's Height")

hist(females$hgt, xlab = "Height of Females in cm", main = "Histogram For Females Height")

The mean of the men’s looks greater than the mean of female heights. Men’s height distribution looks unimodal and symmetric. The distribution of female hieghts is unimodal and it seems to be skewed left

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

fhgtmean <- mean(females$hgt)
fhgtsd   <- sd(females$hgt)
hist(females$hgt, probability = TRUE)
x <- 150:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

It does look like a normal distribution it is hard to tell though by just looking at it.

We can draw a qq plot to have a closer look. If the data points follow the lines closely then it means it is a normal distribution.

qqnorm(females$hgt)
qqline(females$hgt)

As we can see that the data points are very close but there are some outliers towards the top and the bottom so again it is very close to normal.

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

qqnorm(females$hgt)
qqline(females$hgt)

sim_norm <- rnorm(n = length(females$hgt), mean = fhgtmean, sd = fhgtsd)

qqnormsim(females$hgt)

I believe not all the point fall on the line. However, there are some outliers on the far ends. The simulated data does look very similar to the actual data of female heights.

Exercise 4

Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

Yes the normal probability plot for females looks very similar to the simulated data. It is not very evenly distributed but there are outliers which are similarities to the outliers in the simulation. There is sufficient infromation that we can say that the heights for the female are clsoe to normal.

Exercise 5

Using the same technique, determine whether or not female weights appear to come from a normal distribution.

female_weight_mean <- mean(females$wgt)
female_weight_sd   <- sd(females$wgt)
hist(females$wgt, probability = TRUE, main = "Histogram - Female Weights Density", xlab = "Female Weight in Kg")
x <- 40:120
y <- dnorm(x = x, mean = female_weight_mean, sd = female_weight_sd)
lines(x = x, y = y, col = "blue")

It seems as if the Female weights are unimodal distribution. Direction mostly shifts to the right. As we compare this to the normal distribution there is no symmatry so we can see the assymterical nature of the data.

Drawing the Normal Probability plot for Females

qqnorm(females$wgt)
qqline(females$wgt)

If we observe the qqplot we can see that there is a huge deviation between the line and the data itself. There are alot of outliers towards the top and are very far away from the line.However there are not that many ouliers on the lower end.

Simulations of the QQ Plot

qqnormsim(females$wgt)

If we look at the qq plot simulations of the female weights it is very clear that they do not match with the actual data because the data of the simulations is very close to the lines and there are not that many outliers.

Exercise 6

Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

Question 1 for Probabilty.

Lets try to find the probability of Womens height greater than 5 feet and 10 inches.

5 feet 10 inches =70 inches=177.8 cm

Using the Theoritical Normal Distribution Method

pnorm(q=177.8, mean = fhgtmean, sd = fhgtsd,lower.tail=FALSE)
## [1] 0.02411585

Empirical Distribution Method

sum(females$hgt > 177.8)/length(females$hgt)
## [1] 0.01923077

Question 2 for Probabilty.

Find the probability that a female weighs more than 85 kg?

Using the Theoretical Probability

pnorm(q=85, mean = female_weight_mean, sd = female_weight_sd,lower.tail=FALSE)
## [1] 0.005582733

Using the Empirical Method

sum(females$wgt > 85)/length(females$wgt)
## [1] 0.01538462

THe height variable had a closer agreement between the two.

Own Your Own

Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter __B_.

qqnorm(females$bii.di)
qqline(females$bii.di)

b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.

qqnorm(females$elb.di)
qqline(females$elb.di)

c. The histogram for general age (age) belongs to normal probability plot letter D.

qqnorm(bdims$age)
qqline(bdims$age)

d. The histogram for female chest depth (che.de) belongs to normal probability plot letter A.

qqnorm(females$che.de)
qqline(females$che.de)

Note that normal probability plots C and D have a slight stepwise pattern.Why do you think this is the case?

I think this is the case because the stepwise probability patterns are normally created due to discrete variables such as age. SO I beleive the reason is the variables being discrete.

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

Normal Probability Plot

qqnorm(females$kne.di)
qqline(females$kne.di)

Since the data seems to be going upward and as though it is moving towards the right the variable looks to be right skewed and there are outliers on the top end.

Histogram

hist(females$kne.di, xlab = "Female Knee Diameter in cm", main = "Histogram for Female Knee Diameter")

We can clearly see that the distribution is right skewed.