exercise 3.4.1 - Suppose a certain population of observations is normally distributed. What percentage of the observations in the population

#(a) are within 1.5 standard deviations of the mean? #ans: given a normal distribution then xbar = 0, sd = 1, so will use the pnorm fx to calculatue the AUC for +1.5 sd and -1.5 sd and then take the difference

(pnorm(1.5,0,1)-pnorm(-1.5,0,1))*100
## [1] 86.63856

##interpretation: 86.6% of observation are within -1.5 and 1.5 sd of the mean

#(b) are more than 2.5 standard deviations above the mean? #Ans: here we will repeat the procedure above used in a)

(pnorm(2.5,0,1)-pnorm(-2.5,0,1))*100
## [1] 98.75807

##interpretation: 98.7% of observation are within -2.5 and 2.5 sd of the mean

#(c) are more than 3.5 standard deviations away from (above or below) the mean?

(pnorm(3.5,0,1)-pnorm(-3.5,0,1))*100
## [1] 99.95347

##interpretation 99.9% of observations are within -3.5 and 3.5 sd of the mean

Exercise 4.3.2 (a) The 90th percentile of a normal distribution is how many standard deviations above the mean? (b) The 10th percentile of a normal distribution is how many standard deviations below the mean?

#a) here we a given a % or proportion of 90% so we will use the qnorm fx to calculate the z score which on the standarized scale is a measure of sd.

qnorm(.9,0,1)
## [1] 1.281552

##interpretation: the the 90th percentile on a normal distribution is 1.28 sd from the mean

#b) again we are a given a % or proportion of 10% so we will use the qnorm fx to calculate the z score which on the standarized scale is a measure of sd

qnorm(.1,0,1)
## [1] -1.281552

##interpretation: the 10th percentile on a normal distribution is -1.28 sd from the mean

4.3.9 The serum cholesterol levels of 12- to 14-year-olds follow a normal distribution with mean 155 mg/dl and standard deviation 27 mg/dl. What percentage of 12 to 14-year-olds have serum cholesterol values

#(a) 164 or more? calculate the area to the right of 164 using the complement Pr {1-(Pr Y=164)}

(1-pnorm(164,155,27))*100
## [1] 36.94413

##interpretation: 36.9 % of 12-14 yr old have serum cholesterol >= 164 mg/dl

#(b) 137 or less? here calculate area to the left of 137 Pr {Y< 137}

(pnorm(137,155,27))*100
## [1] 25.24925

##interpretation: 25.2 % of 12-14 yr old have serum cholesterol <= 137 mg/dl

#(c) 186 or less? here calculate the area to left of the 186 Pr {Y<186}

(pnorm(186,155,27))*100
## [1] 87.45463

##interpretation: 87.4 % of 12-14 yr old have serum cholesterol <= 186 mg/dl

#(d) 100 or more? here calculate the area to right of 100 using complement Pr{1-(Y=100)}

(1-pnorm(100,155,27))*100
## [1] 97.91768

##interpretation: 97.9 % of 12-14 yr old have serum cholesterol >= 100 mg/dl

#(e) between 159 and 186? here calculate the difference between Pr {159<Y,186}

(pnorm(186,155,27)-pnorm(159,155,27))*100
## [1] 31.56592

##interpretation: 31.5 % of 12-14 yr olds have serum cholesterol bwt 159 to 186

#(f) between 100 and 132? calculate the difference between Pr{100<Y<132}

(pnorm(132,155,27)-pnorm(100,155,27))*100
## [1] 17.6325

##interpretation: 17.6 % of 12-14 yr olds have serum cholesterol bwt 100 to 132

#(g) between 132 and 159? calculate the difference between Pr{132<Y<159}

(pnorm(159,155,27)-pnorm(132,155,27))*100
## [1] 36.17389

##interpretation: 36.1 % of 12-14 yr olds have serum cholesterol bwt 132 to 159

Exercise 4.3.10 Refer to Exercise 4.3.9. Suppose a 13-year-old is chosen at random and let Y be the person’s serum cholesterol value. Given The serum cholesterol levels of 12- to 14-year-olds follow a normal distribution with mean 155 mg/dl and standard deviation 27 mg/dl.

#Find #(a) Pr{Y >= 159}

(1-pnorm(159,155,27))*100
## [1] 44.11129

##interpretation - there is 44.1 % chance that a randomly chosen subject will have a cholesterol greater than 159mg/dl

#(b) Pr{159 < Y < 186}

(pnorm(186,155,27)-pnorm(159,155,27))*100
## [1] 31.56592

##interpretation - there is 31.5 % chance that a randomly chosen subject will have a cholesterol level between 159 and 186 mg/dl

###4.3.11 For the serum cholesterol distribution of Exercise 4.3.9, find (a) the 80th percentile (b) the 20th percentile

#a) find the 80th percentile - here we use the qnorm fx

qnorm(.8,155,27)
## [1] 177.7238

##interpretation: the 80th percentile value is 177.72 mg/dl of cholesterol

#b) find the 20th percentile. use qnorm

qnorm(.2,155,27)
## [1] 132.2762

##interpretation: the 20th percentile value is 132.27 mg/dl of cholesterol

Exercise 4.3.17 Many cities sponsor marathons each year. The following histogram shows the distribution of times that it took for 10,002 runners to complete the Rome marathon in 2008, with a normal curve superimposed. The fastest runner completed the 26.3-mile course in 2 hours and 9 minutes, or 129 minutes. The average time was 245 minutes and the standard deviation was 40 minutes. Use the normal curve to answer the following questions.

#(a) What percentage of times were greater than 200 minutes? use the pnorm fx and complement Pr = {1-Pr(Y=200)}

(1-pnorm(200,245,40))*100
## [1] 86.97055

##interpretation: 86.9% of runner times where greater than 200 minutes

#(b) What is the 60th percentile of the times? use the qnorm fx

qnorm(.6,245,40)
## [1] 255.1339

##interpretation: the 60th percentile for this dataset is 255.13 minutes

#(c) Notice that the normal curve approximation is fairly good except around the 240-minute mark. How can we explain this anomalous behavior of the distribution? #ANS: one possible answer to the change in distribution involves the impact of age on distribution. N Lehto found in a study of the 312,342 runners who ran the Stockholm marathon from 1979 to 2014, that age effected performance. In the study the found the relationship to a 2nd order polynomial, t = a + bx + cx^2. A histogram of age distribution on page 352 of the study looks similar to the distribution given in 4.3.17 c). Lehto’s histogram (fig 4) of time vs # finishers for all 40 yrs olds notes a clear change at approximately 240 minutes similar to 4.3.17. Lehto concluded "The current investigation indicates that there exists an age when the physiology gives peak endurance performance in the marathon. With a level of confidence at 95%, this age was found to be 34.3 ± 2.6 years using the whole sample of male Stockholm Marathon finishers. I will post the paper to Discord.

Exercise 4.4.2 The following three normal quantile plots, (a), (b), and (c), were generated from the distributions shown by histograms I, II, and III. Which normal quantile plot goes with which histogram? How do you know?

##Ans: QQ plot a) goes with hist I - Why from Samuels world ed, pg147 if the top of the plot bends up, then the y values at the upper end of the distribution are too large for the distribution to be bell-shaped; that is, the distribution is skewed to the right or has large outliers. Hist I is skewed to the left so plot a) and hist I match most closely.

#plot b) goes with hist III - form Samuels world ed, pg 147 If the bottom of the plot bends down, then y values at the lower end of the distribution are too small for the distribution to be bell-shaped; that is, the distribution is skewed to the left or has small outliers. Hist III is skewed to the right so plot b) and hist III match most closely.

#plot c) goes with hist II - QQ plot C is the linear without much change in the tails. Hist II has minimal skew. Therefore plot c) and QQplot II most closely match.

Exercise 4.4.6 The following normal quantile plot was created from the times that it took 166 bicycle riders to complete the stage 11 time trial, from Grenoble to Chamrousse, France, in the 2001 Tour de France cycling race.

#(a) Consider the fastest riders. Are their times better than, worse than, or roughly equal to the times one would expect the fastest riders to have if the data came from a truly normal distribution? #Answer - starting at approximately z = +1.5 there is upward curve to the plot. This change in the plot indicates the effect of skewing or pull of outliers #in the data. Therefore I believe the fastest riders times are potentially faster than expected if the distribution was a normal distribution.

#(b) Consider the slowest riders. Are their times better than, worse than, or roughly equal to the times one would expect the slowest riders to have if the data came from a truly normal distribution? #Answer - the plot for the slowest times look linear therefore the slowest riders times are about equal to the times expected from the slowest riders if the data came from a normal distribution.

Exercise 4.S.16 Resting heart rate was measured for a group of subjects; the subjects then drank 6 ounces of coffee. Ten minutes later their heart rates were measured again. The change in heart rate followed a normal distribution, with a mean increase of 7.3 beats per minute and a standard deviation of 11.1. Let Y denote the change in heart rate for a randomly selected person.

#Find #(a) Pr{Y>10} find area to right of 10b/m

(1-pnorm(10,7.3,11.1))*100
## [1] 40.39085

interpretation: there is 40.3% chance that the subject’s change in heart rate after caffeine use is > 10b/m

#(b) Pr{Y>20}, find the area to right of the 20b/m

(1-pnorm(20,7.3,11.1,))*100
## [1] 12.62819

interpretation: there is 12.6% chance that the subject’s change in heart rate after caffeine use is > 20b/m

#(c) Pr{5<Y<15} find the area between 5 to 15 b/m via substraction

(pnorm(15,7.3,11.1)-pnorm(5,7.3,11.1))*100
## [1] 33.81388

interpretation - there is 33.8% chance that the subject’s change in heart rate after caffeine use is bwt 5 and 15 b/m

Exercise 4.S.17 Refer to the heart rate distribution of Exercise 4.S.16. The fact that the standard deviation is greater than the average and that the distribution is normal tells us that some of the data values are negative, meaning that the person’s heart rate went down, rather than up. Find the probability that a randomly chosen person’s heart rate will go down. That is, find Pr{Y<0}.

#find the area to left of 0

(pnorm(0,7.3,11.1))*100
## [1] 25.53791

#interpretation - there is 25.5% chance that the subject’s change in heart rate after caffeine use is < 0 b/m

Exercise 4.S.19 Refer to the heart rate distribution of Exercise 4.S.16. If we use the 1.5 * IQR rule, from Chapter 2, to identify outliers, how large would an observation need to be in order to be labeled an outlier on the upper end? So the upper fence is calculated as Q3+(1.5*IQR). Data in distribution greater than the upper fence is considered an outlier.

#first create a simulated normal distribution called simcaffchange with n = 1000, mean 7.3 and sd 11.1 using rnorm and then calculate the 5 number summary. then calculate upper fence.

simcaffchange <-rnorm(n=1000, mean=7.3, sd=11.1)
summary(simcaffchange)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -31.8336  -0.9292   6.8984   6.9585  14.8335  42.9001

#calculate upper fence Q3+(1.5*IQR)

upperfence <-(14.83071+(1.5*(14.83071-0.08285)))
print(upperfence)
## [1] 36.9525

#generate a boxplot to simcaffchange to look for outliers

boxplot(simcaffchange, horizontal = TRUE)
abline(v= summary(simcaffchange), col= "blue")

#so the boxplot shows 4 outliers at the top of distribution. For this simulated distribution n=1000, mean 7.3, sd 11.1, a number greater than 36.9525 would be considered an outlier.

4.S.20 It is claimed that the heart rates of Exercise 4.S.16 follow a normal distribution. If this is true, which of the following Shapiro–Wilk’s test P-values for a random sample of 15 subjects are consistent with this claim?

  1. P-value = 0.0649
  2. P-value = 0.1545
  3. P-value = 0.2498
  4. P-value = 0.0005

#apply shapiro wilks test to a simulate normal distribution n=15, mean = =7.3, sd = 11.1

shapiro.test(rnorm(n=15, mean=7.3, sd=11.1))
## 
##  Shapiro-Wilk normality test
## 
## data:  rnorm(n = 15, mean = 7.3, sd = 11.1)
## W = 0.96669, p-value = 0.8064

#interpretation: for the n=15, p = 0.7404. the result suggests there is no compelling evidence that data are not normally distributed.