To solve the question from chapter 3 of OpenIntro Statistics, Third Edition
Libraries used
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(utils)
library(DATA606)
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
Q 1: (3.2) Area under the curve, Part I. What percent of a standard normal distribution N(\(\mu\) = 0; \(\sigma\) = 1) is found in each region? Be sure to draw a graph.
A: Plot points used for x-axis range between -3 and 3 because \(\mu \pm 3*\sigma\) is -3 and 3.
(a) Using Normal probability table percentile of Z > -1.13 is 1 - 0.1292 = 0.8708. Area of intrest is between -1.13 and 3. Following shows graphically.
normalPlot(mean = 0, sd = 1, bounds = c(-1.13,3), tails = FALSE)
A: (b) Percentile of Z < 0.18 is 0.5714. Area of intrest is between -3 and 0.18. Following shows graphically.
normalPlot(mean = 0, sd = 1, bounds = c(-3,0.18), tails = FALSE)
A: (c) According to normal probability table Z >= 3:50 is 0.9998. So percentile of Z > 8 is 1 - 0.9998 = 0.0002. Area of intrest is between 8 and beyond. This tells us distribution is right skewed and observation is outlier.
# Since normalPlot function has limitation with plot points -4 and 4. Graph cannot be displayed.
# x <- seq(-4, 4, length = 100) * sd + mean
# normalPlot(mean = 0, sd = 1, bounds = c(8,10), tails = TRUE)
A: (d) Percentile of |Z| < 0.5 is expanded as -0.5 > Z > 0.5. Area of intrest is less than -0.5 and greater than 0.5, tail ends on both sides. (Percentile of -0.5 + (1 - (Percentile of 0.5)). (0.3085 + (1 - 0.6915)) = 0.617. Following shows graphically.
normalPlot(mean = 0, sd = 1, bounds = c(-0.5,0.5), tails = TRUE)
Q 2: (3.4) Triathlon times, Part I. In triathlons, it is common for racers to be placed into age and gender groups. Friends Leo and Mary both completed the Hermosa Beach Triathlon, where Leo competed in the Men, Ages 30 - 34 group while Mary competed in the Women, Ages 25 - 29 group. Leo completed the race in 1:22:28 (4948 seconds), while Mary completed the race in 1:31:53 (5513 seconds). Obviously Leo finished faster, but they are curious about how they did within their respective groups. Can you help them? Here is some information on the performance of their groups:
Remember: a better performance corresponds to a faster finish.
Write down the short-hand for these two normal distributions.
What are the Z-scores for Leo’s and Mary’s finishing times? What do these Z-scores tell you?
Did Leo or Mary rank better in their respective groups? Explain your reasoning.
What percent of the triathletes did Leo finish faster than in his group?
What percent of the triathletes did Mary finish faster than in her group?
If the distributions of finishing times are not nearly normal, would your answers to parts (b) - (e) change? Explain your reasoning.
A: (a) Short hand notation for Leo’s group: N(\(\mu\) = 4313, \(\sigma\) = 583), Mary’s group: N(\(\mu\) = 5261, \(\sigma\) = 807)
(b) Z-Score = \(\frac{(x - \mu)}{\sigma}\). Time taken by Leo = 4948, his Z-score, ZL(x=4948) = \(\frac{(4948 - 4313)}{583}\) = 1.0891938. Time taken by Mary = 5513 her Z-score, ZM(x=5513) = \(\frac{(5513 - 5261)}{807}\) = 0.3122677. Z-score of an observation tells if it falls above or below mean. In the case of Leo, his race completion time is 1.0891938 standard deviations(4313) above mean. Mary’s completion time is 0.3122677 standard deviations(5261) above mean in her group
(c) Since this a triathlon race completion time should be less than mean and negative Z-score is better. If Z-score is positive lower number indicates better performance. Based on Z-score’s it appears Mary ranked better in her group than Leo did in his group.
(d) Based on normal probability table, Leo’s race completion time falls in 86th percentile (rounded to 2 digits). He finished faster than (1 - 0.86 = 0.14) 14% percent of triathletes in his group.
(e) Based on normal probability table, Mayr’s race completion time falls in 62nd percentile (rounded to 2 digits). She finished faster than (1 - 0.62 = 0.38) 38% percent of triathletes in her group.
(f) Z-scores can be calculated even if distribution is not normal. So my answer to questions (b) and (c) does not change. Where as for questions (d) through (e) are dependent on percentile table, if the distribution is not normal my answers will vary.
Q 3: (3.18) Heights of female college students. Below are heights of 25 female college students.
# female data frame
female<-seq(1:25)
height <- c(54,55,56,56,57,58,58,59,60,60,60,61,61,62,62,63,63,63,64,65,65,67,67,69,73)
female.height <- data.frame(female,height)
heightMean <- 61.52
heightSD <- 4.58
hist(female.height$height, probability = TRUE, ylim = c(0, 0.10))
x <- 50:80
y <- dnorm(x = x, mean = heightMean, sd = heightSD)
lines(x = x, y = y, col = "blue")
qqnorm(female.height$height)
qqline(female.height$height)
# Set ranges
height.Rule68Min <- heightMean - heightSD
height.Rule68Max <- heightMean + heightSD
height.Rule95Min <- heightMean - (2*heightSD)
height.Rule95Max <- heightMean + (2*heightSD)
height.Rule99Min <- heightMean - (3*heightSD)
height.Rule99Max <- heightMean + (3*heightSD)
# Apply the ranges to the table
female.height$percentileRule <- ifelse(female.height$height >= height.Rule68Min & female.height$height <= height.Rule68Max, 68, -1)
female.height$percentileRule <- ifelse(female.height$percentileRule == -1 & female.height$height >= height.Rule95Min & female.height$height <= height.Rule95Max, 95, female.height$percentileRule)
female.height$percentileRule <- ifelse(female.height$percentileRule == -1 & female.height$height >= height.Rule99Min & female.height$height <= height.Rule99Max, 99.7, female.height$percentileRule)
nrow(subset(female.height, female.height$percentileRule == 68)) / nrow(female.height) * 100
## [1] 68
nrow(subset(female.height, female.height$percentileRule <= 95)) / nrow(female.height) * 100
## [1] 96
nrow(subset(female.height, female.height$percentileRule <= 99.7)) / nrow(female.height) * 100
## [1] 100
head(female.height,25)
## female height percentileRule
## 1 1 54 95.0
## 2 2 55 95.0
## 3 3 56 95.0
## 4 4 56 95.0
## 5 5 57 68.0
## 6 6 58 68.0
## 7 7 58 68.0
## 8 8 59 68.0
## 9 9 60 68.0
## 10 10 60 68.0
## 11 11 60 68.0
## 12 12 61 68.0
## 13 13 61 68.0
## 14 14 62 68.0
## 15 15 62 68.0
## 16 16 63 68.0
## 17 17 63 68.0
## 18 18 63 68.0
## 19 19 64 68.0
## 20 20 65 68.0
## 21 21 65 68.0
## 22 22 67 95.0
## 23 23 67 95.0
## 24 24 69 95.0
## 25 25 73 99.7
A: (a) Percentage of observations that fall within one standard deviation of mean: 68
Percentage of observations that fall within two standard deviations of mean: 96
Percentage of observations that fall within three standard deviations of mean: 100
As above numbers prove that scores approximately follow the 68-95-99.7% Rule.
(b) Based on visual inspection of histogram scores distribution is unimodal and symmetrical. Curve drawn over the histogram appears to approximately fit the distribution. Also scores on normal distribution plot are line hugging with an exception of outliers on tail ends. Overall scores appear to be normally distributed.
Q 4: (3.22) Defective rate. A machine that produces a special type of transistor (a component of computers) has a 2% defective rate. The production is considered a random process where each transistor is independent of the others.
A: (a) According to geometric distribution, probability of a success in one trial is p and the probability of a failure is (1 - p), then the probability of finding the first success in the nth trial is given by (1 - p)(n-1)p.
Probability of finding defective transistor (p) = 0.02 (2%), probability of finding non defective transistor (1-p) = 1 - 0.02 = 0.98. Total trials (n) = 10. Probability that the tenth transistor selected is defective = (1 - p)(n-1) * p = (0.98)(10-1) * (0.02) = 0.016675
(b) Probability that the machine produces no defective transistors in a batch of 100? 1 - (p100) = 1
(c) According to geometric distribution, mean \(\mu = \frac{1}{p}\) and standard deviation \(\sigma = \sqrt{\frac{1 - p}{{p}^{2}}}\) where probability of finding defective transistor (p) = 0.02.
Mean \(\mu\) = 1/0.02 = 50. So expected number of transistors to be produced by the machine before defective transistor is 50 , standard deviation \(\sigma = \sqrt{\frac{1 - 0.02}{{0.02}^{2}}}\) = 49.4974747
(d) Probability of producing defective transistor by second machine (p) = 0.05. Mean \(\mu\) = 1/0.05 = 20. So expected number of transistors to be produced by the machine before defective transistor is 20 , standard deviation \(\sigma = \sqrt{\frac{1 - 0.05}{{0.05}^{2}}}\) = 19.4935887
(e) As the probability of success increased, in this case producing defective transistors mean and standard deviation decreased. Suggesting machine will produce defective transistors more frequently, wait time of success decreases. Probability of success and mean are inversely proportional. So is standard deviation.
Q 5: (3.38) Male children. While it is often assumed that the probabilities of having a boy or a girl are the same, the actual probability of having a boy is slightly higher at 0.51. Suppose a couple plans to have 3 kids.
A: (a) Probability of having boy is considered success (p) = 0.51, having girl is considered failure (1 - p) = (1 - 0.51) = 0.49. Binominal model can be applied in calculating probability because
In this case number of independent trials(n) is 3, with having 2(k) boys as success. According to the Binomial distribution for 3 trials(n), p = \(\binom{n}{k}\)pk(1-p)(n-k) = \(\binom{3}{2}\) x 0.51(2) x (1-0.51)(3-2)= 0.382347
(b) Combination of 2 boys and a girl can be {B,B,G}, {B,G,B}, {G,B,B}, probability is P({B,B,G}) + P({B,B,G}) + P({B,B,G}) = (0.51)(0.51)(0.49) + (0.51)(0.49)(0.51) + (0.49)(0.51)(0.51) = 0.382347. This proves that (a) and (b) match.
Family wanting to have 8(n) children with 3(k) boys. Combination of 8 trials(n) with 3 success(k) is \(\binom{n}{k} = \frac{n!}{k!\left(n - k \right)!} = \frac{8!}{3!\left(8 - 3\right)!}\) = 56. As tracking 56 combinations of 8 children is more complex and calculation errors may happen.
# factorial(8)/(factorial(3)*factorial(5))
Q 6: (3.42) Serving in volleyball. A not-so-skilled volleyball player has a 15% chance of making the serve, which involves hitting the ball so it passes over the net on a trajectory such that it will land in the opposing team’s court. Suppose that her serves are independent of each other.
A: (a) Using Negative Binomial Distribution. Probability of successful server (p) = 0.15(15%). Failure = (1 - p) = (1 - 0.15) = 0.85. In this case, 10th serve has to be successful 3rd serve. P(3rd(k) successful serve on 10th trial(n)) = \(\binom{n-1}{k-1}\)pk(1-p)(n-k) = \(\binom{10-1}{3-1}\) x 0.15(3) x (1-0.15)(10-3)= 0.0389501
(b) Since all servers are independent probability of having success on 10th serve is same as probability of having success on any serve. In this case it 15%. Probability of having success on 10th serve is 0.15
(c) In the case of question (a) last try which is 10th needed to be successful. So we used Negative Binomial model to determine the probability. Where as in question (b) we are looking for probabity of single event to be successful. Hence difference in answers.
Q 1: (3.1) Area under the curve, Part I. What percent of a standard normal distribution N(\(\mu\) = 0; \(\sigma\) = 1) is found in each region? Be sure to draw a graph.
A: Plot points used for x-axis range between -3 and 3 because \(\mu \pm 3*\sigma\) is -3 and 3.
(a) Using Normal probability table percentile of Z < -1.35 is 0.0885. Following shows graphically.
normalPlot(mean = 0, sd = 1, bounds = c(-3,-1.35), tails = FALSE)
A: (b) Percentile of Z > 1.48 is 1 - 0.9306 = 0.0694. Area of intrest is between 1.48 and 3. Following shows graphically.
normalPlot(mean = 0, sd = 1, bounds = c(1.48,3), tails = FALSE)
A: (c) Percentile of -0.4 < Z < 1.5 is P. Area of intrest is between -0.4 and 1.5. (Percentile of 1.5 - Percentile of -0.4) = (0.9332 - 0.3446)) = 0.5886. Following shows graphically.
normalPlot(mean = 0, sd = 1, bounds = c(-0.4,1.5), tails = FALSE)
A: (d) Percentile of |Z| > 2 is expanded as -2 > Z > 2. Area of intrest is less than -2 and greater than 2, tail ends on both sides. (Percentile of -2 + (1 - (Percentile of 2)). (0.0228 + (1 - 0.9772)) = 0.0456. Following shows graphically.
normalPlot(mean = 0, sd = 1, bounds = c(-2,2), tails = TRUE)
Q 2: (3.3) GRE scores, Part I. Sophia who took the Graduate Record Examination (GRE) scored 160 on the Verbal Reasoning section and 157 on the Quantitative Reasoning section. The mean score for Verbal Reasoning section for all test takers was 151 with a standard deviation of 7, and the mean score for the Quantitative Reasoning was 153 with a standard deviation of 7.67. Suppose that both distributions are nearly normal.
A: (a) Verbal Reasoning short hand: N(\(\mu\) = 151, \(\sigma\) = 7), Quantitative Reasoning section: N(\(\mu\) = 153, \(\sigma\) = 7.67)
(b) Z-Score = \(\frac{(x - \mu)}{\sigma}\). Sophia’s Verbal Reasoning section score is 160, her ZVR(x=160) = \(\frac{(160 - 151)}{7}\) = 1.285714. Her Quantitative Reasoning section score is 157, ZQR(x=157) = \(\frac{(157 - 153)}{7.67}\) = 0.5215124.
# Sophia's Verbal Reasoning section percentile (rounded to 2 digit) Z = 1.29.
x = seq(-2,2,by=0.1)
sdNormaldis <- data.frame(x = x, y = dnorm(x))
ggplot(sdNormaldis,aes(x=x,y=y)) + geom_line() + geom_vline(xintercept=1.29)
# Sophia's Quantitative Reasoning section percentile (rounded to 2 digit) Z = 0.52.
x = seq(-2,2,by=0.1)
sdNormaldis <- data.frame(x = x, y = dnorm(x))
ggplot(sdNormaldis,aes(x=x,y=y)) + geom_line() + geom_vline(xintercept=0.52)
(c) Z-score of an observation tells if it falls above or below mean. In Sophia’s case her Verbal Reasoning section score is 1.29 standard deviation(7) above mean. Quantitative Reasoning section score is 0.52 standard deviation(7.67) above mean.
(d) Average score of Verbal Reasoning section 151. Looking at her Z-scores, she did very well in Verbal Reasoning section. Her score falls 1.29 standard deviations above from mean. She falls in 90th percentile.
(e) Sophia’s Verbal Reasoning section percentile(rounded to 2) Z1.29 = 0.9015, Quantitative Reasoning section percentile(rounded to 2) Z0.52 = 0.6985
(f) Percentile of students who did better than Sophia in Verbal Reasoning section(1 - Sophia’s Verbal Reasoning section) = 1 - 0.9015 = 0.0985. 10% of students did better than Sophia, Percentile of student who did better than Sophia in Quantitative Reasoning section (1 - Sophia’s Quantitative Reasoning section) = 1 - 0.6985 = 0.3015. 30% of students did better than Sophia.
(g) Since both section are on different scales (mean, standard deviation are different) so comparing both scores would be misleading. However comparing percentiles may be useful.
(h) Z-scores can be calculated even if distribution is not normal. So my answer to question (b) does not change. Where as questions (c) through (f) are dependent on percentile table, if the distribution is not normal my answers will vary.
Q: 3 (3.17) Scores on stats final. Below are final exam scores of 20 Introductory Statistics students
# Scores data frame
student<-seq(1:20)
score <- c(57,66,69,71,72,73,74,77,78,78,79,79,81,81,82,83,83,88,89,94)
scores.data <- data.frame(student,score)
scoreMean <- 77.7
scoreSD <- 8.44
hist(scores.data$score, probability = TRUE, ylim = c(0, 0.06))
x <- 50:110
y <- dnorm(x = x, mean = scoreMean, sd = scoreSD)
lines(x = x, y = y, col = "blue")
qqnorm(scores.data$score)
qqline(scores.data$score)
# Set ranges
scores.Rule68Min <- scoreMean - scoreSD
scores.Rule68Max <- scoreMean + scoreSD
scores.Rule95Min <- scoreMean - (2*scoreSD)
scores.Rule95Max <- scoreMean + (2*scoreSD)
scores.Rule99Min <- scoreMean - (3*scoreSD)
scores.Rule99Max <- scoreMean + (3*scoreSD)
# Apply the ranges to the table
scores.data$percentileRule <- ifelse(scores.data$score >= scores.Rule68Min & scores.data$score <= scores.Rule68Max, 68, -1)
scores.data$percentileRule <- ifelse(scores.data$percentileRule == -1 & scores.data$score >= scores.Rule95Min & scores.data$score <= scores.Rule95Max, 95, scores.data$percentileRule)
scores.data$percentileRule <- ifelse(scores.data$percentileRule == -1 & scores.data$score >= scores.Rule99Min & scores.data$score <= scores.Rule99Max, 99.7, scores.data$percentileRule)
nrow(subset(scores.data, scores.data$percentileRule == 68)) / nrow(scores.data) * 100
## [1] 70
nrow(subset(scores.data, scores.data$percentileRule <= 95)) / nrow(scores.data) * 100
## [1] 95
nrow(subset(scores.data, scores.data$percentileRule <= 99.7)) / nrow(scores.data) * 100
## [1] 100
head(scores.data,20)
## student score percentileRule
## 1 1 57 99.7
## 2 2 66 95.0
## 3 3 69 95.0
## 4 4 71 68.0
## 5 5 72 68.0
## 6 6 73 68.0
## 7 7 74 68.0
## 8 8 77 68.0
## 9 9 78 68.0
## 10 10 78 68.0
## 11 11 79 68.0
## 12 12 79 68.0
## 13 13 81 68.0
## 14 14 81 68.0
## 15 15 82 68.0
## 16 16 83 68.0
## 17 17 83 68.0
## 18 18 88 95.0
## 19 19 89 95.0
## 20 20 94 95.0
A: (a) Percentage of observations that fall within one standard deviation of mean: 70
Percentage of observations that fall within two standard deviations of mean: 95
Percentage of observations that fall within three standard deviations of mean: 100
As above numbers prove that scores approximately follow the 68-95-99.7% Rule.
(b) Based on visual inspection of histogram scores distribution is unimodal and symmetrical. Curve drawn over the histogram appears to approximately fit the distribution. Also scores on normal distribution plot are line hugging with an exception of outliers on tail ends. Overall scores appear to be normally distributed.
Q 4 (3.21) Married women. The 2010 American Community Survey estimates that 47.1% of women ages 15 years and over are married.
A: (a) According to geometric distribution, probability of a success in one trial is p and the probability of a failure is (1 - p), then the probability of finding the first success in the nth trial is given by (1 - p)(n-1)p.
Probability of selecting married women (p) = 0.471, probability of selecting women that is not married (1-p) = 1 - 0.471 = 0.529. Total trials (n) = 3. Probability that the third woman selected is the only one who is married = (1 - p)(n-1) * p = (0.529)(3-1) * (0.471) = 0.1318051
(b) Probability that all three randomly selected women are married p3 = 0.1044871
(c) According to geometric distribution, mean \(\mu = \frac{1}{p}\) and standard deviation \(\sigma = \sqrt{\frac{1 - p}{{p}^{2}}}\) where probability of selecting married women (p) = 0.471.
Mean \(\mu\) = 1/0.471 = 2.1231423. So expected number of women to be sampled before selecting married women is 2.1231423 , standard deviation \(\sigma = \sqrt{\frac{1 - 0.471}{{0.471}^{2}}}\) = 1.544212
(d) If proportion is changed to 30%, then probability of selecting married women (p) = 0.30. Mean \(\mu\) = 1/0.30 = 3.3333333. So expected number of women to be sampled before selecting married women is 3.3333333 , standard deviation \(\sigma = \sqrt{\frac{1 - 0.30}{{0.30}^{2}}}\) = 2.7888668
(e) As the probability of success decreases mean and standard deviation increases. Probability of success and mean are inversely proportional. So is standard deviation.
Q 5: (3.37) Exploring combinations. The formula for the number of ways to arrange n objects is n! = n x (n - 1) x … x 2 x 1. This exercise walks you through the derivation of this formula for a couple of special cases. A small company has five employees: Anna, Ben, Carl, Damian, and Eddy. There are five parking spots in a row at the company, none of which are assigned, and each day the employees pull into a random parking spot. That is, all possible orderings of the cars in the row of spots are equally likely.
A: (a) Probabilaty of Anna using first parking spot is \(\frac{1}{5}\), Ben using second sopt is \(\frac{1}{5 - 1}\), Carl using third spot is \(\frac{1}{5 - 2}\), Damian using forth spot is \(\frac{1}{5 - 3}\), Eddy using third spot is \(\frac{1}{5 - 4}\). On a given day probability that the employees park in alphabetical order p = \(\left(\frac{1}{5}\right) * \left(\frac{1}{5 - 1}\right) * \left(\frac{1}{5 - 2}\right) * \left(\frac{1}{5 - 3}\right) * \left(\frac{1}{5 - 4}\right)\) = 0.0083333.
(b) 5 cars can be arranged in 5 factorial(5!) combinations 5! = 5 * 4 * 3 * 2 * 1 = 120.
(c) 8 cars can be arranged in 8 factorial(8!) combinations 8! = 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1 = 40320
Q 6: (3.41) Sampling at school. For a sociology class project you are asked to conduct a survey on 20 students at your school. You decide to stand outside of your dorm’s cafeteria and conduct the survey on a random sample of 20 students leaving the cafeteria after dinner one evening. Your dorm is comprised of 45% males and 55% females.
(a) Probability model that most suits is Negative Binomial Distribution. Following are features of Negative Binomial Distribution
(1) The trials are independent. (2) Each trial outcome can be classified as a success or failure. (3) The probability of a success (p) is the same for each trial. (4) The last trial must be a success.
In our case 4th person in the survey need to 2nd female.
(b) Probability of selecting a female is considered success (p) 0.55, And probability of failure = 1 - p = 1 - 0.55 = 0.45. Number of students to be picked before last trial to be successful (n) = 4. Out of 4 candidates there should be 2 females and last one must be 2nd female. Using Negative Binomial Distribution P(the kth successs on the nth trial) = \(\binom{n-1}{k-1}\)pk(1-p)(n-k) = \(\binom{4-1}{2-1}\) x 0.55(2) x (1-0.55)(4-2)= 0.1837688
# Formula to calculate Negative Binomial Distribution dnbinom(2,4,0.55) = 0.1853002
# There is slight rounding difference in manual calculation and using r function.
(c) As described in the problem there are 3 ways to write 2 males and 1 female combinations. Selecting female is considered a success. Number of students (n) = 3, Success (k) = 1, then Binomial Coecient = \(\binom{n}{k}\) = \(\frac{n!}{k! * \left(n - k\right)}\) = \(\frac{3!}{1! * \left(3 - 1\right)}\) = 3
(d) For Binomial Coecient there are no restriction on the combination of success and failure. They can happen in any sequence. Where has in Negative Binomial Distribution, always last trial is reserved for success. Last success is also known as kth success is reserved for nth trial also known as last trial. Hence we are short of one trial while calculating Negative Binomial Coecient for Negative Binomial Distribution.