Data 606_Chapter 3

3.2. Area under the curve, Part II.

Z > -1.13

A31a <- round((1 - pnorm(-1.13)), 4)
A31a

## [1] 0.8708

Z < .18

A31b <- round(pnorm(.18), 4)
A31b

## [1] 0.5714

Z > 8

A31c <- 1 - pnorm(8) # Not rounding because extremely small value.
A31c

## [1] 6.661338e-16

|Z| < .5

A31d <- round((pnorm(-.5) + (1 - pnorm(.5))), 4)
A31d

## [1] 0.6171

3.4 Triathlon times, Part I.

Short-hand for normal distributions M30_34 N(mu = 4313, sd = 583) W25_29 N(mu = 5261, sd, 807)
Z-scores for each, what do they tell

Leo <- 4948
Mary <- 5513
m.avg <- 4313
m.sd <- 583
w.avg <- 5261
w.sd <- 807
Leo.z <- (Leo - m.avg) / m.sd
Mary.z <- (Mary - w.avg) / w.sd
Leo.ptile <- round(pnorm(Leo.z), 4)
Mary.ptile <- round(pnorm(Mary.z), 4)
cat("Leo's Z-score is", round(Leo.z, 3), "and his percentile is", Leo.ptile, "\nMary's Z-score is ", round(Mary.z, 3), "and her percentile is", Mary.ptile)

## Leo's Z-score is 1.089 and his percentile is 0.862 
## Mary's Z-score is  0.312 and her percentile is 0.6226

Did Leo of Mary rank better in respective groups? Explain reasoning. By comparison with their respective groups, Leo ranked better than Mary. Leo’s z-score of ~1.09 means he beat about 86% of the other runners in the M30-34 division On the other hand, Mary’s z-score of ~.31 means she beat about 62% of other runners in the W25-29 division.
What % of triathletes did Leo finish faster than in group?

round(Leo.ptile, 4)

## [1] 0.862

What % of triathletes did Mary finish faster than in group?

round(Mary.ptile, 4)

## [1] 0.6226

If distributions of finishing times are not nearly normal, how would answers to (b)-(e) change? Explain reasoning. The z-score serves as a tool to compare different normal distributions. If the distributions weren’t approximately normal and took different shapes, we couldn’t use z-scores to map from one to the other. Accordingly, we couldn’t easily compare Leo and Mary’s performance in each of their divisions, and we would need more information about the distributions in order to understand their relative performance.

3.18. Heights of female college students

Mean height is 61.52in, SD of 4.58in. Do heights approximately follow 68-95-99.7 rule?

height <- c(54, 55, 56, 56, 57, 58, 58, 59, 60, 60, 60, 61, 61, 62, 62, 63, 63, 63, 64, 65, 65, 67, 67, 69, 73)
# quick checks for input error 
students.total <- length(height) # Should be 25 students
height.avg <- mean(height) # Should be mean of 61.52 
height.sd <- sd(height) # Should be SD of 4.58
cat(students.total, height.avg, height.sd)

## 25 61.52 4.583667

# Let's see how many of the sample fall within one SD
sd1.below <- (height.avg - height.sd)
sd1.above <- (height.avg + height.sd)
sd1.exp <- round(students.total * .68)
sd1.act <- sum(height > sd1.below & height < sd1.above)
sd2.below <- (height.avg - (height.sd * 2))
sd2.above <- (height.avg + (height.sd * 2))
sd2.exp <- round(students.total * .95)
sd2.act <- sum(height > sd2.below & height < sd2.above)
sd3.below <- (height.avg - (height.sd * 3))
sd3.above <- (height.avg + (height.sd * 3))
sd3.exp <- round(students.total * .997)
sd3.act <- sum(height > sd3.below & height < sd3.above)
cat("SD1 Sample vs. Expected:", sd1.act, "/", sd1.exp, "\nSD2 Sample vs. Expected:", sd2.act, "/", sd2.exp, "\nSD3 Sample vs. Expected:", sd3.act, "/", sd3.exp)

## SD1 Sample vs. Expected: 17 / 17 
## SD2 Sample vs. Expected: 24 / 24 
## SD3 Sample vs. Expected: 25 / 25

Do these data appear to follow normal distribution? Explain reasoning using graphs provided below.

# Replicate the histogram and QQplot in the text to check if the distribution appears normal.
par(mfrow=c(1,2))
hist(height, breaks = 8, xlab = NULL, main = paste("Student heights", "\nBins = 8"))
qqnorm(height, pch = 2, frame = F, main = paste("Student heights", "\nQQplot"))
qqline(height, col = "red", lwd = 2)

These data do appear to follow the normal distribution. I detected some right-side skew when expanding the number of bins, but there does not seem to be dramatic departure from the line in the QQplot.

3.21. (Practice) Married women

married <- .471
unmarried <- (1 - married)

Randomly selected three women between these ages. What is probability that third woman selected is only one married?

thirdwoman <- (unmarried) * (unmarried) * married
A321a <- round(thirdwoman, 4)
A321a

## [1] 0.1318

What is probability that all three are married?

allwomen <- (married)^3
A321b <- round(allwomen, 4)
A321b

## [1] 0.1045

On average, how many women would you expect to sample before selecting a married woman? What is standard deviation?

# Each selection is random and constitutes a separate, independent tria, and we use the geometric distribution to assess probability, which is defined as (1 - p)^(n-1) * p, with mu = 1 / p, and sd = sqrt((1 - p) / p^2) for n trials.
married.avg <-  1 / married
married.sd <- sqrt((unmarried) / married^2)
cat("We'd expect to sample around", round(married.avg), "women\nThe standard deviation is", round(married.sd, 2))

## We'd expect to sample around 2 women
## The standard deviation is 1.54

If proportion of married women was actually 30%, how many women would you expect to sample before selecting a married woman? What is SD?

married.alt <- .3
unmarried.alt <- 1 - married
married.avg.alt <- 1 / married.alt
married.sd.alt <- sqrt((unmarried.alt) / married.alt^2)
cat("We'd expect to sample around", round(married.avg.alt), "women\nThe standard deviation is", round(married.sd.alt, 2))

## We'd expect to sample around 3 women
## The standard deviation is 2.42

Based on answers to parts (c) and (d), how does decreasing the probability of an event affect mean / SD of wait time until success? Decreasing the probability increases the number of samples (as approximated by the mean) we’d expect to need before selecting a married woman. This is also reflected in the standard deviation, which widens with lower probability.

3.22. Defective rate

What is probability 10th transistor produced is first with defect?

# As this is a random process, each transistor represents a separate, independent trail.   We use the geometric distribution for the probability that the 10th product, and none prior, will be defective.  The probability is defined as (1 - p)^(n-1) * p, with mu = 1 / p, and sd = sqrt((1 - p) / p^2) for n trials.
defect <- .02
nominal <- 1 - defect
n1 <- 10
A322a <- round((nominal)^(n1 - 1) * defect, 4)
A322a

## [1] 0.0167

What is probability that the machine produces no defective transistors in batch of 100?

n2 <- 101
A322b <- round((nominal)^(n2 - 1) * defect, 4)
A322b

## [1] 0.0027

On average, how many transistors produced before first with defect? SD?

defect.avg <-  1 / defect
defect.sd <- sqrt((nominal) / defect^2)
cat("We'd expect to sample around", round(defect.avg), "transistors\nThe standard deviation is", round(defect.sd, 2))

## We'd expect to sample around 50 transistors
## The standard deviation is 49.5

Another machine has 5% defective rate. On average how many transistors produced before first with defect? SD?

defect.alt <- .05
nominal.alt <- 1 - defect
defect.avg.alt <- 1 / defect.alt
defect.sd.alt <- sqrt((nominal.alt) / defect.alt^2)
cat("We'd expect to sample around", round(defect.avg.alt), "transistors\nThe standard deviation is", round(defect.sd.alt, 2))

## We'd expect to sample around 20 transistors
## The standard deviation is 19.8

Based on answers to parts (c) and (d), how does increasing the probability of an event affect mean / SD of wait time until success? Increasing the probability of defect (success, in this case) led to a lower mean and standard deviation. This means fewer units need to be produced before a defect is detected

3.37. (Practice) Exploring permutations

On given data, what is probability that employees park in alphabetical order?

# There are five spots and five people, so the likelihood that an employee picks any one spot is p = .2.  The particular configuration that is alphabetically ordered is (1 / n!) where n is 5.
spots <- 5
A337a <- round(1 / factorial(spots), 4)
A337a

## [1] 0.0083

If alphabetical order has equal chance of occuring relative to all other possible orderings, how many ways must there be to arrange the five cars?

A337b <- factorial(spots)
A337b

## [1] 120

If used sample of 8 employees, how many possible ways to order their cars?

spots.alt <- 8
A337c <- round(factorial(spots.alt), 4)
A337c

## [1] 40320

3.38. Male children

boy <- .51
girl <- 1 - boy
kids <- 3

Use binomial model to calculate probability that two will be boys

# The probability of having exactly two boys is provided by dbinom, with n of 3 (children), x of 2 (boys), and probabaility of .51.
A338a <- round(dbinom(2, size = kids, prob = boy), 4)
A338a

## [1] 0.3823

Write out all possible orderings of 3 children, two of whom are boys

# All possible orderings of boy and girl amongst the three children means permutation.
x <- c("boy", "girl")
genders <- 2
permutations(n = 2, r = kids, v = x, repeats.allowed = T)

##      [,1]   [,2]   [,3]  
## [1,] "boy"  "boy"  "boy" 
## [2,] "boy"  "boy"  "girl"
## [3,] "boy"  "girl" "boy" 
## [4,] "boy"  "girl" "girl"
## [5,] "girl" "boy"  "boy" 
## [6,] "girl" "boy"  "girl"
## [7,] "girl" "girl" "boy" 
## [8,] "girl" "girl" "girl"

Use scenarios to calculate same probability from part (a) but use addition rule for disjoint outcomes. Confirm match between (a) and (b).

added <- (boy * boy * girl) + (boy * girl * boy) + (girl * boy * boy)
A338b <- round(added, 4)
A338b

## [1] 0.3823

If we wanted to calculate probability that couple plans to have 8 kids will have 3 boys, describe why (b) approach more tedious than (a) approach. The (b) approach requires iteration / conjugation, while the (a) approach is less computationally intensive - see below:

kids.alt <- 8
A338c <- round(dbinom(3, size = kids.alt, prob = boy), 4)
A338c

## [1] 0.2098

# vs. A338c.alt <- round((boy * boy * boy * girl * girl * girl * girl * girl)...

3.42. Serving in volleyball.

15% of making volleyball serve, independent events

serve <- .15
miss <- 1 - serve

What is probability of 10th try being 3rd successful serve

# We use the negative binomial distribution to determine the probability of observing the 3rd event on the 10th trial.
x <- 10
success <- 3
A342a <- round(dnbinom(x, size = success, prob = serve, log = F), 4)
A342a

## [1] 0.0439

Suppose 2 successes in 9 attempts. What is probability of 10th serve success? They are independent events, so each serve has a 15% chance of success. The 10th is no different than any other.
Probabilities for (a) and (b) should be different. Explain reason for discrepancy. The probability in (a) is about observing k successes over n events. The probability of (b) is about a single independent event which happens to come after a series of other independent events.

Data 606_Chapter 3_Homework

Jeremy O’Brien

February 20, 2018