#Download and Load lab data
download.file("http://www.openintro.org/stat/data/bdims.RData",destfile="bdims.RData")
load("bdims.RData")
#View the bdims by utilizing the head function
head(bdims)
## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8
## che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1 89.5 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5
## 2 97.0 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5
## 3 97.5 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9
## 4 97.0 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0
## 5 97.5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4
## 6 99.9 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5
## wri.gi age wgt hgt sex
## 1 16.5 21 65.6 174.0 1
## 2 17.0 23 71.8 175.3 1
## 3 16.9 28 80.7 193.5 1
## 4 16.6 23 72.6 186.5 1
## 5 18.0 22 78.8 187.2 1
## 6 16.9 21 74.8 181.5 1
#Create two data sets that will highlight male and female body dimensions
#0== Females, 1== Males
mdims <- subset(bdims,sex==1)
fdims <- subset(bdims,sex==0)
Exercise 1 Make a histogram of men’s height and a histogram of women’s heights. How would you compare the various aspects of the two distributions.
#Create a frequency histogram
#Male respondents height in centimeters
hist(mdims$hgt)
#Create a frequency histogram
#Female respondents height in centimeters
hist(fdims$hgt)
## The Normal Distribution
#Calculate the mean and standard deviation of the females heights
fhgtmean <- mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)
#Create a density histogram
#ylim was added to adjust the y-axis
hist(fdims$hgt, probability=TRUE, ylim = c(0,0.06))
#X range as 140 to 190 in order to span the entire range of fheight
x <-140:190
y <-dnorm(x = x, mean = fhgtmean, sd=fhgtsd)
lines(x = x, y = y, col = "blue")
Exercise 2 Based on this plot, does it appear that the data follow a nearly normal distribution? ##Evaluating the Normal Distribution {css_id}
#Construct a normal probability plot
qqnorm(fdims$hgt)
qqline(fdims$hgt)
#Simulating data from a normal distribution using rnorm
#The first argument indicates how many numbers you'd like to generate
sim_norm <- rnorm(n = length(fdims$hgt), mean=fhgtmean, sd = fhgtsd)
sim_norm
## [1] 165.8131 151.8566 161.5643 172.5961 165.7614 165.7821 162.5958
## [8] 189.7133 160.6712 161.1966 155.5564 166.6187 172.6284 164.7766
## [15] 167.7175 169.9862 157.4507 160.1773 150.6689 162.9913 161.8897
## [22] 160.7753 179.9645 155.5262 165.7288 172.0940 174.7992 159.9108
## [29] 152.7347 164.5264 160.9478 162.9533 160.5788 162.1592 180.3940
## [36] 150.4101 162.7247 171.3134 172.7764 166.1660 160.1323 171.9837
## [43] 171.0969 160.8373 167.1484 161.0600 173.5440 178.4968 163.8935
## [50] 164.9446 166.0483 162.3548 166.0160 159.6928 155.1175 168.3452
## [57] 175.0874 175.4410 177.1257 169.1374 172.7429 168.3308 162.4806
## [64] 168.1360 169.2885 176.9760 157.0183 172.3023 169.7284 162.8829
## [71] 170.1119 165.7044 170.4226 172.5597 173.0741 162.6011 159.8827
## [78] 169.3787 171.6966 175.5617 177.3961 173.4887 167.1758 170.5135
## [85] 157.5500 153.4794 163.6136 159.8821 170.0958 164.7754 167.9542
## [92] 160.2178 173.3890 168.7132 160.3220 157.9092 175.8510 164.4651
## [99] 152.5431 168.9657 167.4640 160.1413 161.4527 160.5172 161.5267
## [106] 157.0555 175.0093 160.2179 159.8716 165.2765 169.5754 169.2259
## [113] 166.7408 169.2915 155.5436 154.6015 175.7631 168.0434 149.6165
## [120] 175.2523 157.6347 152.0417 160.1601 161.2652 153.2133 169.5443
## [127] 175.3070 168.1407 180.8209 162.1628 171.9400 164.9589 156.7603
## [134] 171.1374 157.7529 163.6914 169.8971 159.9008 169.9516 165.2991
## [141] 173.8664 167.5409 158.4825 158.7975 155.7927 171.8572 169.5572
## [148] 165.0311 173.9515 154.9940 165.1952 165.9736 162.1886 162.1989
## [155] 168.0663 164.9639 157.2935 162.4621 153.8261 165.2302 156.7391
## [162] 167.2553 160.3755 167.9766 160.4191 161.4031 157.1244 170.3051
## [169] 152.7439 166.9659 173.1414 156.0926 159.2371 157.9221 161.6258
## [176] 175.6358 167.3940 165.2967 165.4591 166.4530 153.1966 163.5657
## [183] 167.6130 171.1999 173.6548 157.6324 174.9019 163.5909 167.9106
## [190] 174.8859 154.8941 154.1470 163.2359 173.5930 164.2523 177.7288
## [197] 157.1159 161.9781 163.4201 169.0798 164.3797 164.7388 163.5262
## [204] 157.9272 155.0673 160.2536 162.5904 158.8576 166.7489 181.0100
## [211] 161.3308 173.4946 156.8486 168.4660 164.4907 159.2789 154.9288
## [218] 152.2163 169.9907 167.7897 176.5428 165.3494 168.0637 167.9616
## [225] 165.8748 160.9481 164.7105 165.1090 166.6054 159.5659 172.6671
## [232] 164.4620 162.3415 167.7929 161.9757 166.8746 166.7911 161.4705
## [239] 172.6237 153.9308 177.6588 166.8179 167.1943 165.6186 167.0684
## [246] 163.4051 166.2961 174.9279 170.1966 168.7432 173.8004 154.2147
## [253] 169.4443 161.8890 157.1156 168.6377 170.0611 164.6748 156.8898
## [260] 178.2617
Exercise 3 Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?
qqnorm(sim_norm)
qqline(sim_norm)
All of the points do not fall on the line and its outcome is similar to the previous normal probability plot because it strays towards the tails.
#Compare it to many more plots
qqnormsim(fdims$hgt)
Exercise 4 Does the normal probability plot for fdims$hgt look similar to the plots created for this simulated data? That is, do plots provide evidence that the female heights are nearly normal?
The plot for fdims$hgt looks similar to Normal QQ Plot (Data) in row one and Normal QQ Plot (Sim) in row three column one. Proof has been provided that the female heights are nearly normal.
Exercise 5 Using the same technique, determine whether or not female weights appear to come from a normal distribution.
#Create a histogram that will highlight female weight
hist(fdims$wgt)
#Calculate the mean and standard deviation of female weight
fwgtmean <- mean(fdims$wgt)
fwgtsd <- sd(fdims$wgt)
#Create a density histogram
hist(fdims$wgt, probability = TRUE)
x <-40:110
y <-dnorm(x = x, mean = fwgtmean, fwgtsd)
lines (x = x, y = y, col = "blue")
#Construct a normal probability plot
qqnorm(fdims$wgt)
qqline(fdims$wgt)
#Compare plots
qqnormsim(fdims$wgt)
According to the week 3 slides there is a bit of right skewness towards the tails.
#Calculate the Z-score
#The value of the variable q pertains the question if a randomly chosen woman was taller than 6 feet, 182 cm
#pnorm gives the area under the normal curve below a given value, q, with a given mean and standard deviation
pnorm(q=182, mean = fhgtmean, fhgtsd)
## [1] 0.9955656
#Calculate the probability empirically
#Determine how many observations fall above 182 then divide this number by the total sample size
sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154
Exercise 6 Write out two probability questions that you would like to answer, one regarding female heights and one regarding female weights. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer argreement between the two methods?
#Determine how many women are shorter than 5 feet, 152.4 cm
#Theoretical Normal Distribution
pnorm(q=152.4, mean = fhgtmean, fhgtsd)
## [1] 0.028342
#Empirican Distribution
sum(fdims$hgt < 152.4 / length(fdims$hgt))
## [1] 0
#Determine how many women weigh more than 150 pounds, 68 kg
#Theoretical Normal Distribution
pnorm(q=68, mean = fwgtmean, fwgtsd)
## [1] 0.7792121
#Empirican Distribution
sum(fdims$wgt > 68 / length(fdims$wgt))
## [1] 260
1A. (Plot B)
#Histogram for female biiliac (pelvic) diameter (bii.di)
hist(fdims$bii.di)
#Normal Probability Plot for female biiliac (pelvic) diameter (bii.di)
qqnorm(fdims$bii.di)
qqline(fdims$bii.di)
1B. (Plot C)
#Histogram for female elbow diameter (elb.di)
hist(fdims$elb.di)
#Normal Probability Plot for female elbow diameter (elb.di)
qqnorm(fdims$elb.di)
qqline(fdims$elb.di)
1C. (Plot D)
#Histogram for general age (age)
hist(fdims$age)
#Normal Probability Plot for general age (age)
qqnorm(fdims$age)
qqline(fdims$age)
1D. (Plot A)
#Histogram for female chest depth (che.de)
hist(fdims$che.de)
#Normal Probability Plot for female chest depth (che.de)
qqnorm(fdims$che.de)
qqline(fdims$che.de)
2. Note that the normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case?
In cases like this, usually its based on the type of variable. Continous variables are oftentimes the reason for outliers. Similar to boxplots, data scientists can find errors in this manner.
qqnorm(fdims$kne.di)
qqline(fdims$kne.di)
Right skewed
hist(fdims$kne.di)