#(a) are within 1.5 standard deviations of the mean? #ans: given a normal distribution then xbar = 0, sd = 1, so will use the pnorm fx to calculatue the AUC for +1.5 sd and -1.5 sd and then take the difference
(pnorm(1.5,0,1)-pnorm(-1.5,0,1))*100
## [1] 86.63856
##interpretation: 86.6% of observation are within -1.5 and 1.5 sd of the mean
#(b) are more than 2.5 standard deviations above the mean? #Ans: here we will repeat the procedure above used in a)
(pnorm(2.5,0,1)-pnorm(-2.5,0,1))*100
## [1] 98.75807
##interpretation: 98.7% of observation are within -2.5 and 2.5 sd of the mean
#(c) are more than 3.5 standard deviations away from (above or below) the mean?
(pnorm(3.5,0,1)-pnorm(-3.5,0,1))*100
## [1] 99.95347
##interpretation 99.9% of observations are within -3.5 and 3.5 sd of the mean
#a) here we a given a % or proportion of 90% so we will use the qnorm fx to calculate the z score which on the standarized scale is a measure of sd.
qnorm(.9,0,1)
## [1] 1.281552
##interpretation: the the 90th percentile on a normal distribution is 1.28 sd from the mean
#b) again we are a given a % or proportion of 10% so we will use the qnorm fx to calculate the z score which on the standarized scale is a measure of sd
qnorm(.1,0,1)
## [1] -1.281552
##interpretation: the 10th percentile on a normal distribution is -1.28 sd from the mean
#(a) 164 or more? calculate the area to the right of 164 using the complement Pr {1-(Pr Y=164)}
(1-pnorm(164,155,27))*100
## [1] 36.94413
##interpretation: 36.9 % of 12-14 yr old have serum cholesterol >= 164 mg/dl
#(b) 137 or less? here calculate area to the left of 137 Pr {Y< 137}
(pnorm(137,155,27))*100
## [1] 25.24925
##interpretation: 25.2 % of 12-14 yr old have serum cholesterol <= 137 mg/dl
#(c) 186 or less? here calculate the area to left of the 186 Pr {Y<186}
(pnorm(186,155,27))*100
## [1] 87.45463
##interpretation: 87.4 % of 12-14 yr old have serum cholesterol <= 186 mg/dl
#(d) 100 or more? here calculate the area to right of 100 using complement Pr{1-(Y=100)}
(1-pnorm(100,155,27))*100
## [1] 97.91768
##interpretation: 97.9 % of 12-14 yr old have serum cholesterol >= 100 mg/dl
#(e) between 159 and 186? here calculate the difference between Pr {159<Y,186}
(pnorm(186,155,27)-pnorm(159,155,27))*100
## [1] 31.56592
##interpretation: 31.5 % of 12-14 yr olds have serum cholesterol bwt 159 to 186
#(f) between 100 and 132? calculate the difference between Pr{100<Y<132}
(pnorm(132,155,27)-pnorm(100,155,27))*100
## [1] 17.6325
##interpretation: 17.6 % of 12-14 yr olds have serum cholesterol bwt 100 to 132
#(g) between 132 and 159? calculate the difference between Pr{132<Y<159}
(pnorm(159,155,27)-pnorm(132,155,27))*100
## [1] 36.17389
##interpretation: 36.1 % of 12-14 yr olds have serum cholesterol bwt 132 to 159
#Find #(a) Pr{Y >= 159}
(1-pnorm(159,155,27))*100
## [1] 44.11129
##interpretation - there is 44.1 % chance that a randomly chosen subject will have a cholesterol greater than 159mg/dl
#(b) Pr{159 < Y < 186}
(pnorm(186,155,27)-pnorm(159,155,27))*100
## [1] 31.56592
##interpretation - there is 31.5 % chance that a randomly chosen subject will have a cholesterol level between 159 and 186 mg/dl
###4.3.11 For the serum cholesterol distribution of Exercise 4.3.9, find (a) the 80th percentile (b) the 20th percentile
#a) find the 80th percentile - here we use the qnorm fx
qnorm(.8,155,27)
## [1] 177.7238
##interpretation: the 80th percentile value is 177.72 mg/dl of cholesterol
#b) find the 20th percentile. use qnorm
qnorm(.2,155,27)
## [1] 132.2762
##interpretation: the 20th percentile value is 132.27 mg/dl of cholesterol
#(a) What percentage of times were greater than 200 minutes? use the pnorm fx and complement Pr = {1-Pr(Y=200)}
(1-pnorm(200,245,40))*100
## [1] 86.97055
##interpretation: 86.9% of runner times where greater than 200 minutes
#(b) What is the 60th percentile of the times? use the qnorm fx
qnorm(.6,245,40)
## [1] 255.1339
##interpretation: the 60th percentile for this dataset is 255.13 minutes
#(c) Notice that the normal curve approximation is fairly good except around the 240-minute mark. How can we explain this anomalous behavior of the distribution? #ANS: one possible answer to the change in distribution involves the impact of age on distribution. N Lehto found in a study of the 312,342 runners who ran the Stockholm marathon from 1979 to 2014, that age effected performance. In the study the found the relationship to a 2nd order polynomial, t = a + bx + cx^2. A histogram of age distribution on page 352 of the study looks similar to the distribution given in 4.3.17 c). Lehto’s histogram (fig 4) of time vs # finishers for all 40 yrs olds notes a clear change at approximately 240 minutes similar to 4.3.17. Lehto concluded "The current investigation indicates that there exists an age when the physiology gives peak endurance performance in the marathon. With a level of confidence at 95%, this age was found to be 34.3 ± 2.6 years using the whole sample of male Stockholm Marathon finishers. I will post the paper to Discord.
##Ans: QQ plot a) goes with hist I - Why from Samuels world ed, pg147 if the top of the plot bends up, then the y values at the upper end of the distribution are too large for the distribution to be bell-shaped; that is, the distribution is skewed to the right or has large outliers. Hist I is skewed to the left so plot a) and hist I match most closely.
#plot b) goes with hist III - form Samuels world ed, pg 147 If the bottom of the plot bends down, then y values at the lower end of the distribution are too small for the distribution to be bell-shaped; that is, the distribution is skewed to the left or has small outliers. Hist III is skewed to the right so plot b) and hist III match most closely.
#plot c) goes with hist II - QQ plot C is the linear without much change in the tails. Hist II has minimal skew. Therefore plot c) and QQplot II most closely match.
#(a) Consider the fastest riders. Are their times better than, worse than, or roughly equal to the times one would expect the fastest riders to have if the data came from a truly normal distribution? #Answer - starting at approximately z = +1.5 there is upward curve to the plot. This change in the plot indicates the effect of skewing or pull of outliers #in the data. Therefore I believe the fastest riders times are potentially faster than expected if the distribution was a normal distribution.
#(b) Consider the slowest riders. Are their times better than, worse than, or roughly equal to the times one would expect the slowest riders to have if the data came from a truly normal distribution? #Answer - the plot for the slowest times look linear therefore the slowest riders times are about equal to the times expected from the slowest riders if the data came from a normal distribution.
#Find #(a) Pr{Y>10} find area to right of 10b/m
(1-pnorm(10,7.3,11.1))*100
## [1] 40.39085
#(b) Pr{Y>20}, find the area to right of the 20b/m
(1-pnorm(20,7.3,11.1,))*100
## [1] 12.62819
#(c) Pr{5<Y<15} find the area between 5 to 15 b/m via substraction
(pnorm(15,7.3,11.1)-pnorm(5,7.3,11.1))*100
## [1] 33.81388
#find the area to left of 0
(pnorm(0,7.3,11.1))*100
## [1] 25.53791
#interpretation - there is 25.5% chance that the subject’s change in heart rate after caffeine use is < 0 b/m
#first create a simulated normal distribution called simcaffchange with n = 1000, mean 7.3 and sd 11.1 using rnorm and then calculate the 5 number summary. then calculate upper fence.
simcaffchange <-rnorm(n=1000, mean=7.3, sd=11.1)
summary(simcaffchange)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -31.8336 -0.9292 6.8984 6.9585 14.8335 42.9001
#calculate upper fence Q3+(1.5*IQR)
upperfence <-(14.83071+(1.5*(14.83071-0.08285)))
print(upperfence)
## [1] 36.9525
#generate a boxplot to simcaffchange to look for outliers
boxplot(simcaffchange, horizontal = TRUE)
abline(v= summary(simcaffchange), col= "blue")
#so the boxplot shows 4 outliers at the top of distribution. For this simulated distribution n=1000, mean 7.3, sd 11.1, a number greater than 36.9525 would be considered an outlier.
#apply shapiro wilks test to a simulate normal distribution n=15, mean = =7.3, sd = 11.1
shapiro.test(rnorm(n=15, mean=7.3, sd=11.1))
##
## Shapiro-Wilk normality test
##
## data: rnorm(n = 15, mean = 7.3, sd = 11.1)
## W = 0.96669, p-value = 0.8064
#interpretation: for the n=15, p = 0.7404. the result suggests there is no compelling evidence that data are not normally distributed.