To solve the question from chapter 3 of OpenIntro Statistics, Third Edition

Libraries used

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(utils)
library(DATA606)
## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.
## 
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
## 
##     demo

Graded Questions

Q 1: (3.2) Area under the curve, Part I. What percent of a standard normal distribution N(\(\mu\) = 0; \(\sigma\) = 1) is found in each region? Be sure to draw a graph.

  1. Z > -1.13 (b) Z < 0.18 (c) Z > 8 (d) |Z| < 0.5

A: Plot points used for x-axis range between -3 and 3 because \(\mu \pm 3*\sigma\) is -3 and 3.

(a) Using Normal probability table percentile of Z > -1.13 is 1 - 0.1292 = 0.8708. Area of intrest is between -1.13 and 3. Following shows graphically.

normalPlot(mean = 0, sd = 1, bounds = c(-1.13,3), tails = FALSE)

A: (b) Percentile of Z < 0.18 is 0.5714. Area of intrest is between -3 and 0.18. Following shows graphically.

normalPlot(mean = 0, sd = 1, bounds = c(-3,0.18), tails = FALSE)

A: (c) According to normal probability table Z >= 3:50 is 0.9998. So percentile of Z > 8 is 1 - 0.9998 = 0.0002. Area of intrest is between 8 and beyond. This tells us distribution is right skewed and observation is outlier.

# Since normalPlot function has limitation with plot points -4 and 4. Graph cannot be displayed.

# x <- seq(-4, 4, length = 100) * sd + mean

# normalPlot(mean = 0, sd = 1, bounds = c(8,10), tails = TRUE)

A: (d) Percentile of |Z| < 0.5 is expanded as -0.5 > Z > 0.5. Area of intrest is less than -0.5 and greater than 0.5, tail ends on both sides. (Percentile of -0.5 + (1 - (Percentile of 0.5)). (0.3085 + (1 - 0.6915)) = 0.617. Following shows graphically.

normalPlot(mean = 0, sd = 1, bounds = c(-0.5,0.5), tails = TRUE)

Q 2: (3.4) Triathlon times, Part I. In triathlons, it is common for racers to be placed into age and gender groups. Friends Leo and Mary both completed the Hermosa Beach Triathlon, where Leo competed in the Men, Ages 30 - 34 group while Mary competed in the Women, Ages 25 - 29 group. Leo completed the race in 1:22:28 (4948 seconds), while Mary completed the race in 1:31:53 (5513 seconds). Obviously Leo finished faster, but they are curious about how they did within their respective groups. Can you help them? Here is some information on the performance of their groups:

  1. The finishing times of the Men, Ages 30 - 34 group has a mean of 4313 seconds with a standard deviation of 583 seconds.
  2. The finishing times of the Women, Ages 25 - 29 group has a mean of 5261 seconds with a standard deviation of 807 seconds.
  3. The distributions of nishing times for both groups are approximately Normal.

Remember: a better performance corresponds to a faster finish.

  1. Write down the short-hand for these two normal distributions.

  2. What are the Z-scores for Leo’s and Mary’s finishing times? What do these Z-scores tell you?

  3. Did Leo or Mary rank better in their respective groups? Explain your reasoning.

  4. What percent of the triathletes did Leo finish faster than in his group?

  5. What percent of the triathletes did Mary finish faster than in her group?

  6. If the distributions of finishing times are not nearly normal, would your answers to parts (b) - (e) change? Explain your reasoning.

A: (a) Short hand notation for Leo’s group: N(\(\mu\) = 4313, \(\sigma\) = 583), Mary’s group: N(\(\mu\) = 5261, \(\sigma\) = 807)

(b) Z-Score = \(\frac{(x - \mu)}{\sigma}\). Time taken by Leo = 4948, his Z-score, ZL(x=4948) = \(\frac{(4948 - 4313)}{583}\) = 1.0891938. Time taken by Mary = 5513 her Z-score, ZM(x=5513) = \(\frac{(5513 - 5261)}{807}\) = 0.3122677. Z-score of an observation tells if it falls above or below mean. In the case of Leo, his race completion time is 1.0891938 standard deviations(4313) above mean. Mary’s completion time is 0.3122677 standard deviations(5261) above mean in her group

(c) Since this a triathlon race completion time should be less than mean and negative Z-score is better. If Z-score is positive lower number indicates better performance. Based on Z-score’s it appears Mary ranked better in her group than Leo did in his group.

(d) Based on normal probability table, Leo’s race completion time falls in 86th percentile (rounded to 2 digits). He finished faster than (1 - 0.86 = 0.14) 14% percent of triathletes in his group.

(e) Based on normal probability table, Mayr’s race completion time falls in 62nd percentile (rounded to 2 digits). She finished faster than (1 - 0.62 = 0.38) 38% percent of triathletes in her group.

(f) Z-scores can be calculated even if distribution is not normal. So my answer to questions (b) and (c) does not change. Where as for questions (d) through (e) are dependent on percentile table, if the distribution is not normal my answers will vary.

Q 3: (3.18) Heights of female college students. Below are heights of 25 female college students.

  1. The mean height is 61.52 inches with a standard deviation of 4.58 inches. Use this information to determine if the heights approximately follow the 68-95-99.7% Rule.
  2. Do these data appear to follow a normal distribution? Explain your reasoning using the graphs provided below.
# female data frame

female<-seq(1:25)
height <- c(54,55,56,56,57,58,58,59,60,60,60,61,61,62,62,63,63,63,64,65,65,67,67,69,73)
female.height <- data.frame(female,height)

heightMean <- 61.52
heightSD <- 4.58

hist(female.height$height, probability = TRUE, ylim = c(0, 0.10))
x <- 50:80
y <- dnorm(x = x, mean = heightMean, sd = heightSD)
lines(x = x, y = y, col = "blue")

qqnorm(female.height$height)
qqline(female.height$height)

# Set ranges
height.Rule68Min <- heightMean - heightSD
height.Rule68Max <- heightMean + heightSD

height.Rule95Min <- heightMean - (2*heightSD)
height.Rule95Max <- heightMean + (2*heightSD)

height.Rule99Min <- heightMean - (3*heightSD)
height.Rule99Max <- heightMean + (3*heightSD)

# Apply the ranges to the table
female.height$percentileRule <- ifelse(female.height$height >= height.Rule68Min & female.height$height <= height.Rule68Max, 68, -1)
female.height$percentileRule <- ifelse(female.height$percentileRule == -1 & female.height$height >= height.Rule95Min & female.height$height <= height.Rule95Max, 95, female.height$percentileRule)
female.height$percentileRule <- ifelse(female.height$percentileRule == -1 & female.height$height >= height.Rule99Min & female.height$height <= height.Rule99Max, 99.7, female.height$percentileRule)


nrow(subset(female.height, female.height$percentileRule == 68)) / nrow(female.height) * 100
## [1] 68
nrow(subset(female.height, female.height$percentileRule <= 95)) / nrow(female.height) * 100
## [1] 96
nrow(subset(female.height, female.height$percentileRule <= 99.7)) / nrow(female.height) * 100
## [1] 100
head(female.height,25)
##    female height percentileRule
## 1       1     54           95.0
## 2       2     55           95.0
## 3       3     56           95.0
## 4       4     56           95.0
## 5       5     57           68.0
## 6       6     58           68.0
## 7       7     58           68.0
## 8       8     59           68.0
## 9       9     60           68.0
## 10     10     60           68.0
## 11     11     60           68.0
## 12     12     61           68.0
## 13     13     61           68.0
## 14     14     62           68.0
## 15     15     62           68.0
## 16     16     63           68.0
## 17     17     63           68.0
## 18     18     63           68.0
## 19     19     64           68.0
## 20     20     65           68.0
## 21     21     65           68.0
## 22     22     67           95.0
## 23     23     67           95.0
## 24     24     69           95.0
## 25     25     73           99.7

A: (a) Percentage of observations that fall within one standard deviation of mean: 68

Percentage of observations that fall within two standard deviations of mean: 96

Percentage of observations that fall within three standard deviations of mean: 100

As above numbers prove that scores approximately follow the 68-95-99.7% Rule.

(b) Based on visual inspection of histogram scores distribution is unimodal and symmetrical. Curve drawn over the histogram appears to approximately fit the distribution. Also scores on normal distribution plot are line hugging with an exception of outliers on tail ends. Overall scores appear to be normally distributed.

Q 4: (3.22) Defective rate. A machine that produces a special type of transistor (a component of computers) has a 2% defective rate. The production is considered a random process where each transistor is independent of the others.

  1. What is the probability that the 10th transistor produced is the first with a defect?
  2. What is the probability that the machine produces no defective transistors in a batch of 100?
  3. On average, how many transistors would you expect to be produced before the first with a defect? What is the standard deviation?
  4. Another machine that also produces transistors has a 5% defective rate where each transistor is produced independent of the others. On average how many transistors would you expect to be produced with this machine before the first with a defect? What is the standard deviation?
  5. Based on your answers to parts (c) and (d), how does increasing the probability of an event affect the mean and standard deviation of the wait time until success?

A: (a) According to geometric distribution, probability of a success in one trial is p and the probability of a failure is (1 - p), then the probability of finding the first success in the nth trial is given by (1 - p)(n-1)p.

Probability of finding defective transistor (p) = 0.02 (2%), probability of finding non defective transistor (1-p) = 1 - 0.02 = 0.98. Total trials (n) = 10. Probability that the tenth transistor selected is defective = (1 - p)(n-1) * p = (0.98)(10-1) * (0.02) = 0.016675

(b) Probability that the machine produces no defective transistors in a batch of 100? 1 - (p100) = 1

(c) According to geometric distribution, mean \(\mu = \frac{1}{p}\) and standard deviation \(\sigma = \sqrt{\frac{1 - p}{{p}^{2}}}\) where probability of finding defective transistor (p) = 0.02.

Mean \(\mu\) = 1/0.02 = 50. So expected number of transistors to be produced by the machine before defective transistor is 50 , standard deviation \(\sigma = \sqrt{\frac{1 - 0.02}{{0.02}^{2}}}\) = 49.4974747

(d) Probability of producing defective transistor by second machine (p) = 0.05. Mean \(\mu\) = 1/0.05 = 20. So expected number of transistors to be produced by the machine before defective transistor is 20 , standard deviation \(\sigma = \sqrt{\frac{1 - 0.05}{{0.05}^{2}}}\) = 19.4935887

(e) As the probability of success increased, in this case producing defective transistors mean and standard deviation decreased. Suggesting machine will produce defective transistors more frequently, wait time of success decreases. Probability of success and mean are inversely proportional. So is standard deviation.

Q 5: (3.38) Male children. While it is often assumed that the probabilities of having a boy or a girl are the same, the actual probability of having a boy is slightly higher at 0.51. Suppose a couple plans to have 3 kids.

  1. Use the binomial model to calculate the probability that two of them will be boys.
  2. Write out all possible orderings of 3 children, 2 of whom are boys. Use these scenarios to calculate the same probability from part (a) but using the addition rule for disjoint outcomes. Confirm that your answers from parts (a) and (b) match.
  3. If we wanted to calculate the probability that a couple who plans to have 8 kids will have 3 boys, briey describe why the approach from part (b) would be more tedious than the approach from part (a).

A: (a) Probability of having boy is considered success (p) = 0.51, having girl is considered failure (1 - p) = (1 - 0.51) = 0.49. Binominal model can be applied in calculating probability because

  1. The trials are independent.
  2. The number of trials, n, is fixed.
  3. Each trial outcome can be classified as a success or failure.
  4. The probability of a success, p, is the same for each trial.

In this case number of independent trials(n) is 3, with having 2(k) boys as success. According to the Binomial distribution for 3 trials(n), p = \(\binom{n}{k}\)pk(1-p)(n-k) = \(\binom{3}{2}\) x 0.51(2) x (1-0.51)(3-2)= 0.382347

(b) Combination of 2 boys and a girl can be {B,B,G}, {B,G,B}, {G,B,B}, probability is P({B,B,G}) + P({B,B,G}) + P({B,B,G}) = (0.51)(0.51)(0.49) + (0.51)(0.49)(0.51) + (0.49)(0.51)(0.51) = 0.382347. This proves that (a) and (b) match.

Family wanting to have 8(n) children with 3(k) boys. Combination of 8 trials(n) with 3 success(k) is \(\binom{n}{k} = \frac{n!}{k!\left(n - k \right)!} = \frac{8!}{3!\left(8 - 3\right)!}\) = 56. As tracking 56 combinations of 8 children is more complex and calculation errors may happen.

# factorial(8)/(factorial(3)*factorial(5))

Q 6: (3.42) Serving in volleyball. A not-so-skilled volleyball player has a 15% chance of making the serve, which involves hitting the ball so it passes over the net on a trajectory such that it will land in the opposing team’s court. Suppose that her serves are independent of each other.

  1. What is the probability that on the 10th try she will make her 3rd successful serve?
  2. Suppose she has made two successful serves in nine attempts. What is the probability that her 10th serve will be successful?
  3. Even though parts (a) and (b) discuss the same scenario, the probabilities you calculated should be different. Can you explain the reason for this discrepancy?

A: (a) Using Negative Binomial Distribution. Probability of successful server (p) = 0.15(15%). Failure = (1 - p) = (1 - 0.15) = 0.85. In this case, 10th serve has to be successful 3rd serve. P(3rd(k) successful serve on 10th trial(n)) = \(\binom{n-1}{k-1}\)pk(1-p)(n-k) = \(\binom{10-1}{3-1}\) x 0.15(3) x (1-0.15)(10-3)= 0.0389501

(b) Since all servers are independent probability of having success on 10th serve is same as probability of having success on any serve. In this case it 15%. Probability of having success on 10th serve is 0.15

(c) In the case of question (a) last try which is 10th needed to be successful. So we used Negative Binomial model to determine the probability. Where as in question (b) we are looking for probabity of single event to be successful. Hence difference in answers.

Practice Questions

Q 1: (3.1) Area under the curve, Part I. What percent of a standard normal distribution N(\(\mu\) = 0; \(\sigma\) = 1) is found in each region? Be sure to draw a graph.

  1. Z < -1.35 (b) Z > 1.48 (c) -0.4 < Z < 1.5 (d) |Z| > 2

A: Plot points used for x-axis range between -3 and 3 because \(\mu \pm 3*\sigma\) is -3 and 3.

(a) Using Normal probability table percentile of Z < -1.35 is 0.0885. Following shows graphically.

normalPlot(mean = 0, sd = 1, bounds = c(-3,-1.35), tails = FALSE)

A: (b) Percentile of Z > 1.48 is 1 - 0.9306 = 0.0694. Area of intrest is between 1.48 and 3. Following shows graphically.

normalPlot(mean = 0, sd = 1, bounds = c(1.48,3), tails = FALSE)

A: (c) Percentile of -0.4 < Z < 1.5 is P. Area of intrest is between -0.4 and 1.5. (Percentile of 1.5 - Percentile of -0.4) = (0.9332 - 0.3446)) = 0.5886. Following shows graphically.

normalPlot(mean = 0, sd = 1, bounds = c(-0.4,1.5), tails = FALSE)

A: (d) Percentile of |Z| > 2 is expanded as -2 > Z > 2. Area of intrest is less than -2 and greater than 2, tail ends on both sides. (Percentile of -2 + (1 - (Percentile of 2)). (0.0228 + (1 - 0.9772)) = 0.0456. Following shows graphically.

normalPlot(mean = 0, sd = 1, bounds = c(-2,2), tails = TRUE)

Q 2: (3.3) GRE scores, Part I. Sophia who took the Graduate Record Examination (GRE) scored 160 on the Verbal Reasoning section and 157 on the Quantitative Reasoning section. The mean score for Verbal Reasoning section for all test takers was 151 with a standard deviation of 7, and the mean score for the Quantitative Reasoning was 153 with a standard deviation of 7.67. Suppose that both distributions are nearly normal.

  1. Write down the short-hand for these two normal distributions.
  2. What is Sophia’s Z-score on the Verbal Reasoning section? On the Quantitative Reasoning section? Draw a standard normal distribution curve and mark these two Z-scores.
  3. What do these Z-scores tell you?
  4. Relative to others, which section did she do better on?
  5. Find her percentile scores for the two exams.
  6. What percent of the test takers did better than her on the Verbal Reasoning section? On the Quantitative Reasoning section?
  7. Explain why simply comparing raw scores from the two sections could lead to an incorrect conclusion as to which section a student did better on.
  8. If the distributions of the scores on these exams are not nearly normal, would your answers to parts (b) - (f) change? Explain your reasoning.

A: (a) Verbal Reasoning short hand: N(\(\mu\) = 151, \(\sigma\) = 7), Quantitative Reasoning section: N(\(\mu\) = 153, \(\sigma\) = 7.67)

(b) Z-Score = \(\frac{(x - \mu)}{\sigma}\). Sophia’s Verbal Reasoning section score is 160, her ZVR(x=160) = \(\frac{(160 - 151)}{7}\) = 1.285714. Her Quantitative Reasoning section score is 157, ZQR(x=157) = \(\frac{(157 - 153)}{7.67}\) = 0.5215124.

# Sophia's Verbal Reasoning section percentile (rounded to 2 digit) Z = 1.29.
x = seq(-2,2,by=0.1)
sdNormaldis <- data.frame(x = x, y = dnorm(x))
ggplot(sdNormaldis,aes(x=x,y=y)) + geom_line() + geom_vline(xintercept=1.29)

# Sophia's Quantitative Reasoning section percentile (rounded to 2 digit) Z = 0.52.
x = seq(-2,2,by=0.1)
sdNormaldis <- data.frame(x = x, y = dnorm(x))
ggplot(sdNormaldis,aes(x=x,y=y)) + geom_line() + geom_vline(xintercept=0.52)

(c) Z-score of an observation tells if it falls above or below mean. In Sophia’s case her Verbal Reasoning section score is 1.29 standard deviation(7) above mean. Quantitative Reasoning section score is 0.52 standard deviation(7.67) above mean.

(d) Average score of Verbal Reasoning section 151. Looking at her Z-scores, she did very well in Verbal Reasoning section. Her score falls 1.29 standard deviations above from mean. She falls in 90th percentile.

(e) Sophia’s Verbal Reasoning section percentile(rounded to 2) Z1.29 = 0.9015, Quantitative Reasoning section percentile(rounded to 2) Z0.52 = 0.6985

(f) Percentile of students who did better than Sophia in Verbal Reasoning section(1 - Sophia’s Verbal Reasoning section) = 1 - 0.9015 = 0.0985. 10% of students did better than Sophia, Percentile of student who did better than Sophia in Quantitative Reasoning section (1 - Sophia’s Quantitative Reasoning section) = 1 - 0.6985 = 0.3015. 30% of students did better than Sophia.

(g) Since both section are on different scales (mean, standard deviation are different) so comparing both scores would be misleading. However comparing percentiles may be useful.

(h) Z-scores can be calculated even if distribution is not normal. So my answer to question (b) does not change. Where as questions (c) through (f) are dependent on percentile table, if the distribution is not normal my answers will vary.

Q: 3 (3.17) Scores on stats final. Below are final exam scores of 20 Introductory Statistics students

  1. The mean score is 77.7 points. with a standard deviation of 8.44 points. Use this information to determine if the scores approximately follow the 68-95-99.7% Rule.
  2. Do these data appear to follow a normal distribution? Explain your reasoning using the graphs provided below.
# Scores data frame

student<-seq(1:20)
score <- c(57,66,69,71,72,73,74,77,78,78,79,79,81,81,82,83,83,88,89,94)
scores.data <- data.frame(student,score)

scoreMean <- 77.7
scoreSD <- 8.44

hist(scores.data$score, probability = TRUE, ylim = c(0, 0.06))
x <- 50:110
y <- dnorm(x = x, mean = scoreMean, sd = scoreSD)
lines(x = x, y = y, col = "blue")

qqnorm(scores.data$score)
qqline(scores.data$score)

# Set ranges
scores.Rule68Min <- scoreMean - scoreSD
scores.Rule68Max <- scoreMean + scoreSD

scores.Rule95Min <- scoreMean - (2*scoreSD)
scores.Rule95Max <- scoreMean + (2*scoreSD)

scores.Rule99Min <- scoreMean - (3*scoreSD)
scores.Rule99Max <- scoreMean + (3*scoreSD)

# Apply the ranges to the table
scores.data$percentileRule <- ifelse(scores.data$score >= scores.Rule68Min & scores.data$score <= scores.Rule68Max, 68, -1)
scores.data$percentileRule <- ifelse(scores.data$percentileRule == -1 & scores.data$score >= scores.Rule95Min & scores.data$score <= scores.Rule95Max, 95, scores.data$percentileRule)
scores.data$percentileRule <- ifelse(scores.data$percentileRule == -1 & scores.data$score >= scores.Rule99Min & scores.data$score <= scores.Rule99Max, 99.7, scores.data$percentileRule)


nrow(subset(scores.data, scores.data$percentileRule == 68)) / nrow(scores.data) * 100
## [1] 70
nrow(subset(scores.data, scores.data$percentileRule <= 95)) / nrow(scores.data) * 100
## [1] 95
nrow(subset(scores.data, scores.data$percentileRule <= 99.7)) / nrow(scores.data) * 100
## [1] 100
head(scores.data,20)
##    student score percentileRule
## 1        1    57           99.7
## 2        2    66           95.0
## 3        3    69           95.0
## 4        4    71           68.0
## 5        5    72           68.0
## 6        6    73           68.0
## 7        7    74           68.0
## 8        8    77           68.0
## 9        9    78           68.0
## 10      10    78           68.0
## 11      11    79           68.0
## 12      12    79           68.0
## 13      13    81           68.0
## 14      14    81           68.0
## 15      15    82           68.0
## 16      16    83           68.0
## 17      17    83           68.0
## 18      18    88           95.0
## 19      19    89           95.0
## 20      20    94           95.0

A: (a) Percentage of observations that fall within one standard deviation of mean: 70

Percentage of observations that fall within two standard deviations of mean: 95

Percentage of observations that fall within three standard deviations of mean: 100

As above numbers prove that scores approximately follow the 68-95-99.7% Rule.

(b) Based on visual inspection of histogram scores distribution is unimodal and symmetrical. Curve drawn over the histogram appears to approximately fit the distribution. Also scores on normal distribution plot are line hugging with an exception of outliers on tail ends. Overall scores appear to be normally distributed.

Q 4 (3.21) Married women. The 2010 American Community Survey estimates that 47.1% of women ages 15 years and over are married.

  1. We randomly select three women between these ages. What is the probability that the third woman selected is the only one who is married?
  2. What is the probability that all three randomly selected women are married?
  3. On average, how many women would you expect to sample before selecting a married woman? What is the standard deviation?
  4. If the proportion of married women was actually 30%, how many women would you expect to sample before selecting a married woman? What is the standard deviation?
  5. Based on your answers to parts (c) and (d), how does decreasing the probability of an event affect the mean and standard deviation of the wait time until success

A: (a) According to geometric distribution, probability of a success in one trial is p and the probability of a failure is (1 - p), then the probability of finding the first success in the nth trial is given by (1 - p)(n-1)p.

Probability of selecting married women (p) = 0.471, probability of selecting women that is not married (1-p) = 1 - 0.471 = 0.529. Total trials (n) = 3. Probability that the third woman selected is the only one who is married = (1 - p)(n-1) * p = (0.529)(3-1) * (0.471) = 0.1318051

(b) Probability that all three randomly selected women are married p3 = 0.1044871

(c) According to geometric distribution, mean \(\mu = \frac{1}{p}\) and standard deviation \(\sigma = \sqrt{\frac{1 - p}{{p}^{2}}}\) where probability of selecting married women (p) = 0.471.

Mean \(\mu\) = 1/0.471 = 2.1231423. So expected number of women to be sampled before selecting married women is 2.1231423 , standard deviation \(\sigma = \sqrt{\frac{1 - 0.471}{{0.471}^{2}}}\) = 1.544212

(d) If proportion is changed to 30%, then probability of selecting married women (p) = 0.30. Mean \(\mu\) = 1/0.30 = 3.3333333. So expected number of women to be sampled before selecting married women is 3.3333333 , standard deviation \(\sigma = \sqrt{\frac{1 - 0.30}{{0.30}^{2}}}\) = 2.7888668

(e) As the probability of success decreases mean and standard deviation increases. Probability of success and mean are inversely proportional. So is standard deviation.

Q 5: (3.37) Exploring combinations. The formula for the number of ways to arrange n objects is n! = n x (n - 1) x … x 2 x 1. This exercise walks you through the derivation of this formula for a couple of special cases. A small company has five employees: Anna, Ben, Carl, Damian, and Eddy. There are five parking spots in a row at the company, none of which are assigned, and each day the employees pull into a random parking spot. That is, all possible orderings of the cars in the row of spots are equally likely.

  1. On a given day, what is the probability that the employees park in alphabetical order?
  2. If the alphabetical order has an equal chance of occurring relative to all other possible orderings, how many ways must there be to arrange the five cars?
  3. Now consider a sample of 8 employees instead. How many possible ways are there to order these 8 employees’ cars?

A: (a) Probabilaty of Anna using first parking spot is \(\frac{1}{5}\), Ben using second sopt is \(\frac{1}{5 - 1}\), Carl using third spot is \(\frac{1}{5 - 2}\), Damian using forth spot is \(\frac{1}{5 - 3}\), Eddy using third spot is \(\frac{1}{5 - 4}\). On a given day probability that the employees park in alphabetical order p = \(\left(\frac{1}{5}\right) * \left(\frac{1}{5 - 1}\right) * \left(\frac{1}{5 - 2}\right) * \left(\frac{1}{5 - 3}\right) * \left(\frac{1}{5 - 4}\right)\) = 0.0083333.

(b) 5 cars can be arranged in 5 factorial(5!) combinations 5! = 5 * 4 * 3 * 2 * 1 = 120.

(c) 8 cars can be arranged in 8 factorial(8!) combinations 8! = 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1 = 40320

Q 6: (3.41) Sampling at school. For a sociology class project you are asked to conduct a survey on 20 students at your school. You decide to stand outside of your dorm’s cafeteria and conduct the survey on a random sample of 20 students leaving the cafeteria after dinner one evening. Your dorm is comprised of 45% males and 55% females.

  1. Which probability model is most appropriate for calculating the probability that the 4th person you survey is the 2nd female? Explain.
  2. Compute the probability from part (a).
  3. The three possible scenarios that lead to 4th person you survey being the 2nd female are {M,M,F,F},{M,F,M,F},{F,M,M,F} One common feature among these scenarios is that the last trial is always female. In the first three trials there are 2 males and 1 female. Use the binomial coecient to confirm that there are 3 ways of ordering 2 males and 1 female.
  4. Use the findings presented in part (c) to explain why the formula for the coecient for the negative binomial is \(\binom{n-1}{k-1}\) while the formula for the binomial coecient is \(\binom{n}{k}\).

(a) Probability model that most suits is Negative Binomial Distribution. Following are features of Negative Binomial Distribution

(1) The trials are independent. (2) Each trial outcome can be classified as a success or failure. (3) The probability of a success (p) is the same for each trial. (4) The last trial must be a success.

In our case 4th person in the survey need to 2nd female.

(b) Probability of selecting a female is considered success (p) 0.55, And probability of failure = 1 - p = 1 - 0.55 = 0.45. Number of students to be picked before last trial to be successful (n) = 4. Out of 4 candidates there should be 2 females and last one must be 2nd female. Using Negative Binomial Distribution P(the kth successs on the nth trial) = \(\binom{n-1}{k-1}\)pk(1-p)(n-k) = \(\binom{4-1}{2-1}\) x 0.55(2) x (1-0.55)(4-2)= 0.1837688

# Formula to calculate Negative Binomial Distribution dnbinom(2,4,0.55) = 0.1853002

# There is slight rounding difference in manual calculation and using r function.

(c) As described in the problem there are 3 ways to write 2 males and 1 female combinations. Selecting female is considered a success. Number of students (n) = 3, Success (k) = 1, then Binomial Coecient = \(\binom{n}{k}\) = \(\frac{n!}{k! * \left(n - k\right)}\) = \(\frac{3!}{1! * \left(3 - 1\right)}\) = 3

(d) For Binomial Coecient there are no restriction on the combination of success and failure. They can happen in any sequence. Where has in Negative Binomial Distribution, always last trial is reserved for success. Last success is also known as kth success is reserved for nth trial also known as last trial. Hence we are short of one trial while calculating Negative Binomial Coecient for Negative Binomial Distribution.