Data 606 Assignment 3 - Distributions

3.2 Area under the curve (Use normalPlot)
3.4 Triathon times, Part I.
3.18 Heights of female college students (Use qqnormsim from lab 3)
3.22 Defective rate.
3.38 Male children
3.42 Serving in volleyball

This document presents homework Assignment 3 for Data 606.

3.2 Area under the curve (Use normalPlot)

We are asked to find the percent of a standard normal distribution is found in each region below and to draw a graph.

Note that we can use -Inf and Inf to plug into the normal CDF to calculate tail regions for the pnorm and normalPlot functions below.

We also express probabilities in decimal format rather than percentage format. We use 0.5 to mean 50% percent probability.

pnorm(-Inf)  # effectively equals 0

## [1] 0

pnorm(Inf)   # effectively equals 1

## [1] 1

\(Z > -1.13\) has probability \[Pr[ Z > -1.13] = 0.8707619\].

normalPlot(bounds=c(-1.13,Inf))

Z < .18 has probability \[Pr[ Z < .18 ] = 0.5714237 \].

normalPlot(bounds=c(-Inf, 0.18))

Z > 8 has probability \[Pr[ Z > 8 ] = 6.6613381\times 10^{-16} \]

normalPlot(bounds=c(8,Inf))

\(|Z| < 0.5\) has probability \[ Pr[|Z| < 0.5 ] = 0.3829249\].

normalPlot(bounds=c(-0.5, 0.5 ) )

3.4 Triathon times, Part I.

The shorthand for these distributions is \[ N(\mu = 4313, \sigma = 583) \]
Z-score for Leo’s and Mary’s times are calculated below:

(zLeo = (4948 - 4313) / 583.0 )

## [1] 1.089194

(zMary = (5513 - 5261 ) / 807 )

## [1] 0.3122677

Leo’s Z-Score is 1.089 Mary’s Z-Score is 0.312

Both racers were slower than average because a positive Z-score denotes a longer running time.

Mary ranks better than Leo in her respective group because her Z-score is less than Leo’s Z-score. A lower Z-score means a faster running time.
Leo finished faster than 13.80% on all male runners based on the calculation shown below.

(leo_fraction_faster = 1 - pnorm( zLeo ) )

## [1] 0.1380342

Mary finished faster than 37.74% of all female runners based on the calcualation shown below.

(mary_fraction_faster = 1 - pnorm(zMary))

## [1] 0.3774186

If the distributions of finishing times are not nearly normal, some of the answers to parts b-e would change while others would not.

part b would not change. It asks to calculate Z-scores, which is the same regardless of the true probability distribution. However, one could no longer infer a probability from the Z-score because the shape of the probability distribution is unknown. That, c, d, e could change.

3.18 Heights of female college students (Use qqnormsim from lab 3)

We first load the heights into a vector for analysis.

fheights = as.numeric( c(54,55,56,56,57,58,58,59,60,60,60,61,61,62,62,63,63,63,64,65,65,67,67,69,73 ) )

Next we validate the count, mean and standard deviation match the textbook’s assertion: They do.

length(fheights)

## [1] 25

mean(fheights)

## [1] 61.52

sd(fheights)

## [1] 4.583667

To test the rule, we first convert all raw heights to Z scores.

(Z = (fheights - mean(fheights) ) / sd(fheights) )

##  [1] -1.6406080 -1.4224420 -1.2042761 -1.2042761 -0.9861101 -0.7679442
##  [7] -0.7679442 -0.5497782 -0.3316122 -0.3316122 -0.3316122 -0.1134463
## [13] -0.1134463  0.1047197  0.1047197  0.3228856  0.3228856  0.3228856
## [19]  0.5410516  0.7592175  0.7592175  1.1955494  1.1955494  1.6318813
## [25]  2.5045451

P1 = length( subset( Z,   Z <= 1 & Z >= -1 ) ) / length( Z )
P2 = length( subset( Z,   Z <= 2 & Z >= -2 ) ) / length( Z )
P3 = length( subset( Z,   Z <= 3 & Z >= -3 ) ) / length( Z )

df = data.frame( stdev = c( 1, 2, 3), empirical = c( P1, P2, P3), normal = c( .68, .95, .997 ) )

knitr::kable(df, digits = 4, caption="Comparing Females Heights Distribution with 1-3 SDs")

Comparing Females Heights Distribution with 1-3 SDs
stdev	empirical	normal
1	0.68	0.680
2	0.96	0.950
3	1.00	0.997

Looking at the table, we conclude that the 68-95-99.7 rule fits the heights data very well.

The plots of the histogram and QQ-plot suggest the data is very well behaved in meeting the normal distribution. The histogram does follow the bell curve (unimodal, symmetric). The QQ-plot shows little outliers at either tail.

```

3.22 Defective rate.

We use the formula and methods for geometric distribution explained on page 143. Defect rate is \(p = 0.02\) for a machine producing transistors.

p = 0.02

The probability that the first defect is the 10th transistor is \[ Pr[\text{1st defect at 10th transistor}] = p (1-p)^9 = 0.016675 \]
The probability of no defects in a batch of 100 is:

\[ Pr[\text{no defects in 100 transistors}] = (1-p)^{100} = 0.1326196 \] While a single transistor has a low probability of defect, the impact of compounding means that some defects are common in large batches.

We expect on average \(1/p\) trials to observe a defect. This is 50 trials.
The standard deviation is \[ \sigma = \sqrt{ \frac{1-p}{p^2} } = \sqrt{ \frac{ 0.98}{(0.02)^2} } = 49.4974747 \text{ trials }\]
If another machine has a defect rate \(q = 0.05\) we are asked to repeat the calculations for average trials to first defect and the associated standard deviation. This gives:

q = 0.05

\[ \text{ Expected trials to first defect} = \frac{1}{q} = \frac{1}{0.05} = 20 \text{ trials} \]

\[ \text{ Standard deviation of trials to first defect} = \sqrt{\frac{1-q}{q^2} } = \sqrt{\frac{ 0.95}{0.05^{2}}} = 19.4935887 \text{ trials} \]

Increasing the probability of a “success” or defect in transistors, reduces the expected trials to first defect and reduces the uncertainty of the wait time. Obviously, if the probability of defect goes to 1, then defects are absolutely certain. Then the first transistor is always defective and there is no uncertainty of the wait time. Thus both wait time goes to 1 trial and standard deviation goes to zero in the limit.

A more mathematical approach to use first derivatives of mean and standard deviation to measure the marginal impact:

\[ \text{marginal sensitivity of wait time} = \frac{d}{dp}\left( \frac{1}{p}\right) = -\frac{1}{p^2} \]

\[ \text{marginal sensitivity of stdev} = \frac{d\sigma}{dp} = \frac{d}{dp}\left( \sqrt{ \frac{1-p}{p^2} } \right) \] \[ = (1/2)(1-p)^{-3/2}p^{-1}-p^{-2}(1-p)^{-1/2} \] \[ = p^{-1}(1-p)^{-1/2}\left( (1/2)(1-p)^{-1} - p^{-1} \right) \] \[ = p^{-1}(1-p)^{-1/2} \frac{ 2p- 2}{2(1-p)p} = p^{-1}(1-p)^{-1/2} \frac{-1}{p}\] \[ \frac{d\sigma}{dp} = \frac{-1}{p^2 \sqrt{1-p} } < 0 \text{ for all 0<p<1 }\]

We conclude that the first derivatives of both quantities are negative for all positive probabilities p. Thus, increasing the defect rate always decreasing the wait time and its standard deviation.

3.38 Male children

The probability of two boys in three children using the binomial model where probability of one boy \(p=0.51\) is:

\[ Pr[\text{2 boys in 3 kids}] = \binom{3}{2}(0.51)^2 (.49) = 0.382347 \]

The possible orderings of all 2 boys in 3 children are: BBG, BGB, GBB. The probability of each individual scenario is (respectively): \[ Pr[BBG] = (0.51) (0.51) (0.49) = 0.127449 \] \[ Pr[BGB] = (0.51) (0.49) (0.51) = 0.127449 \] \[ Pr[GBB] = (0.49) (0.51) (0.51) = 0.127449 \]

The sum of the identical probabilites of these 3 scenarios is:

\[ 3 \times (0.51)^2 (0.49) = 3 \times 0.127449 = 0.382347 \] This confirms that parts a and b give equivalent answers.

The approach in part (b) is not scalable because enumerating all combinations is impractical for even reasonable combinations of \(n\) and \(k\). In the case of \(n=8\) and \(k=3\), there are

\[\binom{8}{3}= 56\] combinations to enumerate.

3.42 Serving in volleyball

Probability that a volleyball player will make the 3rd successful serve on the 10th try can be calculated using negative binomial distribution. Applying the formula on page 154 and setting \(n=10\), \(k=3\), \(p=0.15\) suffices to get the value:

n=10
k=3
p=0.15
( answer = choose(n-1,k-1) * p^k * (1-p)^(n-k) )

## [1] 0.03895012

\[Prob[\text{3rd success on 10 attempts} ] = \binom{n-1}{k-1}p^{k}(1-p)^{n-k}= 0.0389501\]

If she has already made two successful serves in nine attempts, the probability that her 10th serve is successful remains unchanged at 0.15 because the 10th serve is independent.
There is no contradiction between parts a and b because they are calculating probabilities of different events. Part a is asking for the unconditional probability of 3 successes in 10 trials before they have started. Part b is asking for a conditional probability after 9 trials have occurred in a specific manner of 1 future trial.