Excercise 1 : Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?
load("more/bdims.RData")
mdims<-subset(bdims,sex==1)
fdims<-subset(bdims,sex==0)
hist(fdims$hgt,main="DIstribution of Female Heights")
hist(mdims$hgt,main="DIstribution of Male Heights")
The shape of the two distributions are quite similar: Both male and female heights show a symmetric and unimodal distribution (the multiple modes here are best thought of as a result of sampling variability and a small binwidth). The spread of the two distributions is also similar, with most of the observations falling within an interval spanning ~25 cm. They differ most notably in their centers, with a mean/median/mode of ~178 cm and ~165 cm for men and women, respectively.
fhgtmean<-mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)
hist(fdims$hgt, probability = TRUE,ylim = c(0,0.06))
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x,y,type = "l")
Exercise 2 : Based on the this plot, does it appear that the data follow a nearly normal distribution?
Answer : Yes, it appears from the diagram that the data approximately follows normal distribution. But it does not clearly says so.
sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
Exercise 3 : Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?
qqnorm(sim_norm)
qqline(sim_norm)
Most of the points fall on the line. But a few of the points on the lower and upper end are not on the line.
To get a better clarity on the nature of distribution, we can compare it to many more plots using the following simulation function, It may be helpful to click the zoom button in the plot window.
qqnormsim(fdims$hgt)
Exercise 4 :Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?
Yes, The probabily plots of the femail height is similar to the simulated data plots. Both shows the female heights are nearly normal.
Exercise 5 :Using the same technique, determine whether or not female weights appear to come from a normal distribution.
qqnorm(fdims$wgt)
qqline(fdims$wgt)
qqnormsim(fdims$wgt)
Female weight seems to be approximately follow normal distribution but with a slight right skew.
Exercise 6: Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
2*pnorm(q = fhgtmean-fhgtsd, mean = fhgtmean, sd = fhgtsd)
## [1] 0.3173105
Area under the curve below of mean-sd and above mean + sd is 0.3173105. So the probabality of a randomly chosen famale height fall above mean + sd OR below mean - sd using theoritical normal distribution is approximately 31.73%.
The same probability using empirical distribution is as follows
sum(fdims$hgt > fhgtmean+fhgtsd ) / length(fdims$hgt) + sum(fdims$hgt < fhgtmean-fhgtsd ) / length(fdims$hgt)
## [1] 0.3076923
Using Empirical distribution the probability of of a randomly chosen famale height fall above mean + sd OR below mean - sd is 30.77% which is reasonably close to 31.73%
fwgtmean<-mean(fdims$wgt)
fwgtsd <- sd(fdims$wgt)
2*pnorm(q = fwgtmean-fwgtsd, mean = fwgtmean, sd = fwgtsd)
## [1] 0.3173105
Area under the curve below of mean-sd and above mean + sd is 0.3173105. So the probabality of a randomly chosen famale weight fall above mean + sd OR below mean - sd using theoritical normal distribution is approximately 31.73%.
The same probability using empirical distribution is as follows
sum(fdims$wgt > fwgtmean+fwgtsd ) / length(fdims$wgt) + sum(fdims$wgt < fwgtmean-fwgtsd ) / length(fdims$wgt)
## [1] 0.2923077
Using Empirical distribution the probability of of a randomly chosen famale weight fall above mean + sd OR below mean - sd is 29.23% which is close to 31.73%.
Female heights has closer agreement between the two methods.
** Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.**
The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.
The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.
The histogram for general age (age) belongs to normal probability plot letter D.
The histogram for female chest depth (che.de) belongs to normal probability plot letter A
Answers with plots
qqnorm(fdims$bii.di)
qqline(fdims$bii.di)
The answer for (a) is B
qqnorm(fdims$elb.di)
qqline(fdims$elb.di)
The answer for (b) is C
qqnorm(fdims$age)
qqline(fdims$age)
The answer for (c) is D
qqnorm(fdims$che.de)
qqline(fdims$che.de)
The answer for (a) is A
2.Note that normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case?
Answer
These discrete variables such as age are expressed in whole numbers and the elbow data could have been rounded and many ended in zero/whole numbers.
fdims$age
## [1] 22 20 19 25 21 23 26 22 28 40 32 25 25 29 22 25 23 37 19 23 25 26 24
## [24] 29 22 30 23 38 23 19 46 20 22 25 21 23 31 29 19 21 23 24 20 19 20 19
## [47] 20 19 22 39 18 19 26 20 20 26 21 21 38 23 37 19 25 20 41 26 21 47 19
## [70] 44 35 32 46 22 49 52 25 48 41 18 30 20 24 23 30 23 45 20 20 23 21 28
## [93] 45 24 25 19 20 29 24 24 25 31 22 20 32 25 19 23 22 20 27 34 25 26 19
## [116] 26 25 20 21 18 19 27 26 36 20 28 32 32 23 20 20 20 23 20 28 23 19 28
## [139] 19 29 32 20 28 36 22 20 22 32 40 40 42 40 44 30 28 37 40 45 35 41 27
## [162] 20 24 36 27 32 64 21 32 35 41 40 29 40 24 23 41 44 53 19 24 25 20 34
## [185] 32 24 29 31 34 36 32 39 37 52 24 33 42 34 37 39 41 36 19 22 23 36 45
## [208] 25 67 26 21 33 25 24 21 35 27 27 26 25 44 29 26 23 32 32 43 32 41 33
## [231] 28 28 25 38 37 25 37 27 27 20 19 32 26 56 23 19 31 34 34 24 22 34 30
## [254] 32 40 29 21 33 33 38
fdims$elb.di
## [1] 11.2 12.1 11.3 12.3 11.5 12.5 12.3 13.3 12.1 13.4 11.8 12.8 12.8 10.6
## [15] 11.5 11.5 11.2 13.4 10.3 13.4 11.1 13.7 13.2 12.5 13.1 12.0 12.6 12.4
## [29] 12.6 12.4 11.3 11.9 13.2 11.8 11.3 12.3 12.7 11.5 12.9 12.7 12.4 13.1
## [43] 11.2 12.8 11.8 12.6 12.6 12.0 12.8 12.4 12.9 12.0 12.9 11.7 12.6 13.0
## [57] 13.4 12.4 12.4 13.0 12.4 13.4 11.5 13.8 14.2 10.9 11.5 12.6 11.5 13.4
## [71] 12.2 12.0 12.8 12.6 13.2 12.8 11.5 12.0 12.9 13.1 13.1 12.1 14.0 13.6
## [85] 12.0 12.6 11.3 10.1 12.2 11.9 12.4 11.8 11.9 11.6 12.0 12.1 12.4 11.1
## [99] 13.4 12.3 11.0 11.6 11.5 12.6 10.9 12.1 13.0 11.6 12.7 12.4 11.7 14.1
## [113] 10.7 13.2 11.8 10.8 12.9 11.5 10.6 11.6 11.5 12.8 11.3 13.1 12.0 12.4
## [127] 9.9 11.8 12.1 11.2 11.6 12.4 13.0 10.4 11.5 11.2 11.6 11.2 13.0 11.1
## [141] 12.3 12.5 12.3 12.0 11.0 11.0 11.9 11.6 12.0 13.9 12.6 13.0 12.6 13.0
## [155] 13.2 12.8 12.2 12.0 13.2 14.0 13.1 12.9 12.7 12.4 12.4 13.2 12.4 13.8
## [169] 13.1 12.4 11.8 13.1 14.0 12.7 12.4 12.4 12.9 12.4 12.8 12.4 13.8 11.8
## [183] 12.6 12.2 13.2 13.6 12.2 12.2 13.4 12.8 11.3 12.4 12.6 13.0 12.9 12.2
## [197] 13.4 12.0 13.4 13.4 13.6 12.2 12.9 12.0 12.8 11.2 12.9 11.6 14.3 13.8
## [211] 13.2 12.2 11.5 11.8 12.4 11.6 11.7 12.4 13.0 12.8 12.4 12.2 13.3 12.7
## [225] 13.2 11.8 15.0 12.0 13.2 12.4 12.4 12.7 12.9 11.4 13.0 12.9 11.6 11.0
## [239] 11.6 11.8 12.8 12.0 11.2 12.6 12.6 11.8 10.6 12.4 13.4 12.8 14.0 12.4
## [253] 13.4 13.1 13.3 12.9 12.4 12.0 12.0 13.4
3.As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
#normal probability plot of knee diameters
qqnorm(fdims$kne.di)
qqline(fdims$kne.di)
This shows stronger deviation from linearity in the right tail. That suggests its right skewed than what we would expect uder a normal distribution. Following histogram confirms it.
#Histogram for knee diameters
hist(mdims$kne.di)