Probability

The Normal Distribution

The Data

load("more/bdims.RData")
head(bdims)
##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri.gi age  wgt   hgt sex
## 1   16.5  21 65.6 174.0   1
## 2   17.0  23 71.8 175.3   1
## 3   16.9  28 80.7 193.5   1
## 4   16.6  23 72.6 186.5   1
## 5   18.0  22 78.8 187.2   1
## 6   16.9  21 74.8 181.5   1
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)
  1. Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

Plotting both histograms together we see they are very similar.

m<-hist(mdims$hgt,plot=FALSE)
m
## $breaks
##  [1] 155 160 165 170 175 180 185 190 195 200
## 
## $counts
## [1]  2  5 28 44 76 50 28 12  2
## 
## $density
## [1] 0.001619433 0.004048583 0.022672065 0.035627530 0.061538462 0.040485830
## [7] 0.022672065 0.009716599 0.001619433
## 
## $mids
## [1] 157.5 162.5 167.5 172.5 177.5 182.5 187.5 192.5 197.5
## 
## $xname
## [1] "mdims$hgt"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
f<-hist(fdims$hgt,plot=FALSE)
f
## $breaks
## [1] 145 150 155 160 165 170 175 180 185
## 
## $counts
## [1]  3 15 52 63 69 38 18  2
## 
## $density
## [1] 0.002307692 0.011538462 0.040000000 0.048461538 0.053076923 0.029230769
## [7] 0.013846154 0.001538462
## 
## $mids
## [1] 147.5 152.5 157.5 162.5 167.5 172.5 177.5 182.5
## 
## $xname
## [1] "fdims$hgt"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
h<-rbind(m$density,f$density)
## Warning in rbind(m$density, f$density): number of columns of result is not
## a multiple of vector length (arg 2)
h
##             [,1]        [,2]       [,3]       [,4]       [,5]       [,6]
## [1,] 0.001619433 0.004048583 0.02267206 0.03562753 0.06153846 0.04048583
## [2,] 0.002307692 0.011538462 0.04000000 0.04846154 0.05307692 0.02923077
##            [,7]        [,8]        [,9]
## [1,] 0.02267206 0.009716599 0.001619433
## [2,] 0.01384615 0.001538462 0.002307692
barplot(h,beside = T)

mean(mdims$hgt)
## [1] 177.7453
mean(fdims$hgt)
## [1] 164.8723
sd(mdims$hgt)
## [1] 7.183629
sd(fdims$hgt)
## [1] 6.544602

The normal distribution

fhgtmean <- mean(fdims$hgt)
fhgtsd   <- sd(fdims$hgt)

hist(fdims$hgt, probability = TRUE, ylim = c(0, 0.06))
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

  1. Based on the this plot, does it appear that the data follow a nearly normal distribution?

Yes it does, the historgram seems to match the normal distribution line very well.

Evaluating the normal distribution

qqnorm(fdims$hgt)
qqline(fdims$hgt)

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
  1. Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

No, not all points fall on the line. Similar to the real data, points at the extremes do not follow the line, while points in the middle do.

qqnorm(sim_norm)
qqline(sim_norm)

  1. Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

We run several simulations and look at the corresponding Q-Q Plots. In all these simulations we see how points fall on the line very well. The simulations indicate this dataset is in fact normally distributed.

qqnormsim(fdims$hgt)

  1. Using the same technique, determine whether or not female weights appear to come from a normal distribution.

We can first look at a Q-Q Plot of the dataset. Doing so shows similar results to the heights analysis. With some exceptions towards the edges, data seems to fit a normal distribution.

qqnorm(fdims$hgt)
qqline(fdims$hgt)

We can also run simulations using the mean and standard deviation of the dataset to see how they compare against a normal distribution. Again, we find most point following the straight line, and therefore we can conclude is is also reasonable to assume a normal distribution.

qqnormsim(fdims$wgt)

Normal probabilities

What is the probability that a randomly chosen young adult female is taller than 6 feet (about 182 cm)?

#Usint normal distribution Z values
1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387
#empirical solution
sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154
  1. Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

What is the probability that a randomly chosen young adult female is shorter than 5 feet (152.4 cm)?

#Theoretical Normal Distribution
pnorm(q = 152.4, mean = fhgtmean, sd = fhgtsd)
## [1] 0.028342
#Empirical
length(fdims$hgt[fdims$hgt<152.4])/length(fdims$hgt)
## [1] 0.02692308

What is the probability that a randomly chosen young adult female is weights more than 150 lbs (68.0389 Kg)?

#Theoretical Normal Distribution
fwgtmean <- mean(fdims$wgt)
fwgtsd   <- sd(fdims$wgt)
1 - pnorm(q = 68.0389, mean = fwgtmean, sd = fwgtsd)
## [1] 0.2195895
#Empirical
length(fdims$wgt[fdims$wgt>68.0389])/length(fdims$wgt)
## [1] 0.1923077

On Your Own

  • Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

    a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.

qqnorm(fdims$bii.di)
qqline(fdims$bii.di)

**b.** The histogram for female elbow diameter (`elb.di`) belongs to normal 
probability plot letter __C__.
qqnorm(fdims$elb.di)
qqline(fdims$elb.di)

**c.** The histogram for general age (`age`) belongs to normal probability 
plot letter __D__.
qqnorm(fdims$age)
qqline(fdims$age)

**d.** The histogram for female chest depth (`che.de`) belongs to normal 
probability plot letter __A__.
qqnorm(fdims$che.de)
qqline(fdims$che.de)

  • Note that normal probability plots C and D have a slight stepwise pattern.
    Why do you think this is the case?

If we observe the distribution plots for these two datasets (C being age and D being che.de) we can see how they are both right skew. This is consistant with the deviation from the straight line seen in the probability plots.

  • As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

Probability plot show the dataset being right skew, with more point falling off the line towards the righ.

qqnorm(fdims$kne.di)
qqline(fdims$kne.di)

The density plot confirms this also showing a histogram which is right skewed.

hist(fdims$kne.di)

histQQmatch

histQQmatch