The Data

This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.

body_dim <- read.csv("bdims.csv")
head(body_dim)
##   bia_di bii_di bit_di che_de che_di elb_di wri_di kne_di ank_di sho_gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che_gi wai_gi nav_gi hip_gi thi_gi bic_gi for_gi kne_gi cal_gi ank_gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri_gi age  wgt   hgt sex
## 1   16.5  21 65.6 174.0   1
## 2   17.0  23 71.8 175.3   1
## 3   16.9  28 80.7 193.5   1
## 4   16.6  23 72.6 186.5   1
## 5   18.0  22 78.8 187.2   1
## 6   16.9  21 74.8 181.5   1

Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.

male_dims <- subset(body_dim, sex == 1)
female_dims <- subset(body_dim, sex == 0)

Exercise 1: Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

#men's heights
hist(male_dims$hgt,
     breaks = 10,
     xlab = "Height",
     main = "Histogram of Men's heights",
     col = "blue",
     probability = TRUE)
x <- male_dims$hgt
xfit <- seq(min(x)-5, max(x)+5, length = 200)
yfit <- dnorm(xfit, mean(x), sd(x))
lines(xfit, yfit, col = "red", lwd = 2)

#summary of men's heights
summary(male_dims$hgt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   157.2   172.9   177.8   177.7   182.7   198.1
#women's heights
hist(female_dims$hgt,
     breaks = 10,
     xlab = "Height",
     main = "Histogram of women's heights",
     col = "pink",
     probability = TRUE)
x <- female_dims$hgt
xfit <- seq(min(x)-5, max(x)+5, length = 200)
yfit <- dnorm(xfit, mean(x), sd(x))
lines(xfit, yfit, col = "blue", lwd = 2)

#summary of women's heights
summary(female_dims$hgt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   147.2   160.0   164.5   164.9   169.5   182.9

Both histograms clearly shows its normal distribution. male mean heght is bit higher compared to female mean height.

The normal distribution

hist(female_dims$hgt, probability=TRUE, ylim=c(0,0.1), main = "Histogram of female heights", xlab = "Female Hights")
x <- female_dims$hgt
xfit <- seq(min(x)-5, max(x)+5, length = 200)
yfit <- dnorm(xfit, mean(x), sd(x))
lines(xfit, yfit, col = "blue", lwd = 2)
abline(v = mean(x), col = "red")

Exercise 2: Based on the this plot, does it appear that the data follow a nearly normal distribution?

Althogh the curve does not fit the histogram exactly, they are symmetric and have nearly close mean points.THis data does follow nearly normal distribution.

Evaluating the normal distribution

qqnorm(female_dims$hgt)
qqline(female_dims$hgt)

sim_norm <- rnorm(n = length(female_dims$hgt), mean = mean(female_dims$hgt), sd = sd(female_dims$hgt))
sim_norm
##   [1] 171.7533 172.8630 168.5978 158.0601 171.9240 167.7843 175.8042
##   [8] 157.4392 159.8249 168.7719 163.1065 173.3008 168.1577 147.3689
##  [15] 169.1575 156.7638 170.6536 174.0842 171.8990 151.5477 165.1810
##  [22] 167.3593 167.1150 170.2172 163.8685 179.5994 159.9916 173.4141
##  [29] 158.4794 161.8544 159.1807 159.3132 160.4531 169.5200 164.7910
##  [36] 163.9928 157.9089 159.1050 146.4558 175.9130 158.0167 163.1168
##  [43] 165.1922 174.7304 162.0949 162.7623 157.7561 161.8615 165.9791
##  [50] 162.5310 173.9068 172.5563 168.4724 169.3918 164.6970 152.4766
##  [57] 169.6992 159.5458 163.2476 166.7321 167.0166 153.3052 160.2097
##  [64] 165.4128 161.7460 163.3077 169.7757 158.2919 159.1127 166.5790
##  [71] 175.8100 165.8324 161.6232 156.3538 176.3340 170.8443 168.0204
##  [78] 174.6352 168.2202 163.0485 164.1081 178.3102 151.2851 160.4238
##  [85] 161.1405 175.1160 170.3037 165.3845 164.6686 163.0187 161.3210
##  [92] 160.1541 161.1009 165.0528 161.8597 167.3087 165.8517 161.9555
##  [99] 157.9496 161.4734 159.5046 166.1024 174.2957 159.6604 160.7427
## [106] 154.9671 168.8063 174.1037 168.7858 167.9974 170.7770 173.9208
## [113] 158.7378 161.7192 168.7496 174.9693 161.9662 162.9366 170.3492
## [120] 167.7067 147.2342 167.7121 167.8513 160.7974 180.4553 176.7887
## [127] 161.3457 171.0351 161.6317 167.4560 161.6269 164.3806 161.1528
## [134] 169.8029 163.0032 166.3441 152.2301 168.7879 171.9590 161.1942
## [141] 152.5543 155.3259 162.4303 156.8581 158.3281 166.6881 165.1458
## [148] 192.0515 167.2153 163.4158 161.8127 165.0578 171.3266 164.2662
## [155] 162.9555 170.2349 176.7625 162.3735 171.1872 173.5057 165.0197
## [162] 170.6775 169.0944 159.7394 167.8367 164.2715 161.0428 168.9819
## [169] 161.5120 162.5395 155.2980 163.2091 164.7805 162.4444 176.8249
## [176] 158.4045 164.3436 159.3891 169.4190 164.2456 160.6794 159.8888
## [183] 173.5962 165.8643 156.7777 164.7504 170.1018 170.4869 151.7153
## [190] 175.6011 160.8333 156.7898 157.8890 167.9886 167.4680 175.5777
## [197] 154.2576 168.0402 175.4997 160.7848 170.5853 169.1589 165.8205
## [204] 162.3781 151.1336 173.7855 170.3454 150.2041 169.6465 175.6572
## [211] 174.3770 162.6144 164.3360 167.9169 166.0498 158.6823 160.3131
## [218] 156.0749 163.1476 162.4013 160.8811 173.4538 159.2656 155.7479
## [225] 155.9755 160.6640 165.5417 174.8941 175.3290 176.1331 172.9595
## [232] 177.3140 154.2776 178.3951 161.1282 157.1781 153.6836 177.3439
## [239] 166.9600 166.5840 157.2265 165.1957 157.9690 163.7599 165.2916
## [246] 167.9095 164.5104 172.6602 170.8365 165.3277 153.6182 171.2932
## [253] 164.5434 166.8918 156.1617 158.8777 167.0882 174.1882 168.4806
## [260] 171.6876

Exercise 3: Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

qqnorm(sim_norm)
qqline(sim_norm)

All the points of sim_norm data fall very close to the theoretical line, except for the few at end of each tails. Also in real data has the similar characteristics except at the end points directions are opposite compared to the one for simulated data.

Does the normal probability plot for female_dims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?

library(StMoSim)
qqnormSim(female_dims$hgt)

It is very similar to the simulated data. We can conclude that female heights are nearly normal.

Exercise 5: Using the same technique, determine whether or not female weights appear to come from a normal distribution.

#NOrmal plot for real data
qqnorm(female_dims$wgt)
qqline(female_dims$wgt)

#simulated data plot for femal weights
qqnormSim(female_dims$wgt)

based on the plots where two fat tails skewed more to the left with more curvier look. This shows that female weight data may not be nearly normal.

Normal probabilities

1 - pnorm(q = 182, mean = mean(female_dims$hgt), sd = sd(female_dims$hgt))
## [1] 0.004434387
# Since we’re interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.
sum(female_dims$hgt >182)/length(female_dims$hgt)
## [1] 0.003846154

Although the probabilities are not exactly the same, they are reasonably close. The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be.

Exercise 6: Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?

“What is the probability that a randomly chosen young adult female is shorter than 165 cm?”

pnorm(q = 165, mean = mean(female_dims$hgt), sd = sd(female_dims$hgt))
## [1] 0.5077833
sum(female_dims$hgt<165)/length(female_dims$hgt)
## [1] 0.5076923

both answers are very close, which is over 50%

“What is the probability that a randomly chosen young adult female is taller than 175 cm?”

1-pnorm(q = 175, mean = mean(female_dims$hgt), sd = sd(female_dims$hgt))
## [1] 0.06087283
sum(female_dims$hgt>175)/length(female_dims$hgt)
## [1] 0.07692308

Again very close answers in regards to femal hights.

“What is the probability that a randomly chosen young adult female is weigh more than 50kg?”

1-pnorm(q = 50, mean = mean(female_dims$wgt), sd = sd(female_dims$wgt))
## [1] 0.864857
sum(female_dims$wgt>50)/length(female_dims$wgt)
## [1] 0.8807692

very close answers in probability of female weights over 50kg.

“What is the probability that a randomly chosen young adult female is weigh less than 75kg?”

pnorm(q = 75, mean = mean(female_dims$wgt), sd = sd(female_dims$wgt))
## [1] 0.9328698
sum(female_dims$wgt<75)/length(female_dims$wgt)
## [1] 0.9269231

Similar answers once again in regards to probability of female weights over 75kg.

On Your Own

1. Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

  1. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter B.
  1. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter C.
  1. The histogram for general age (age) belongs to normal probability plot letter D.
  1. The histogram for female chest depth (che.de) belongs to normal probability plot letter A.

2. Note that normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case?

stepwise probalility plot patterns are usually produced by discrete variables such as age whereas continous patterns (such as a normal curve) are produced by continuous variables.

# Plot for (bii.di)
qqnorm(female_dims$bii_di)
qqline(female_dims$bii_di)

# Plot for (elb.di)
qqnorm(female_dims$elb_di)
qqline(female_dims$elb_di)

# Plot for (age)
qqnorm(female_dims$age)
qqline(female_dims$age)

# Plot for (che.de)
qqnorm(female_dims$che_de)
qqline(female_dims$che_de)

#### 3. As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

qqnorm(female_dims$kne_di)
qqline(female_dims$kne_di)

#### Based on the probability plot data appeared to be right skewed.

hist(female_dims$kne_di, main = "Histogram of female Knee Diameter", xlab = "Female Knee Diameter")

Histogram shows how the data are skewed to the right.