Distributions

(This file still a work in progress.)

All material here based on the free, excellent statistics textbook at openintro.org. These are largely my notes to serve as aide-memoire. I make no guarantees that there are no mistakes or typos, though I certainly have a motivation to avoid tripping myself up in the future with careless mistakes.

The exercises are from chapter 3 of the same text. If you intend to use R fo statistical analysis, becoming familiar, nay agile, with the functions for all of the common probability distributions is a prerequisite. The questions in these exercises should lead immediately and instantly to the proper function and parameters for that function. When you are sure-footed with that, you’re ready to move on.

Distribution Functions

Each distribution has functions for density, distribution, quantile and random generation.

Probability Density Function (PDF)

The density function (commonly known as a Probability Density Function or PDF) indicates how high the distribution curve is at an exact point. The distribution function can mean different things, based on the distribution. The quantile function returns the quantile value (in standard deviations) for a specified place in a distribution.

In R, the PDF name starts with “d”: dnorm, dbinomial, dunif, etc.

Here is a simple example. The plot of the PDF for the standard normal distribution, so the mean is 0 and the standard deviation is 1. The range of probabilities is from -3 to +3 standard deviations.

set.seed(1)
data <- dnorm(seq(-3, 3, 0.05))
x <- seq(-3, 3, 0.05)
plot(x, data, type="l", xlab="S.D.s from Mean", ylab="Prob.")

plot of chunk unnamed-chunk-1

This shows that the probability of a value of 0 is about 0.39. We can get specific probabilities using the PDF, which is dnorm.

dnorm(x=0)

## [1] 0.3989

A PDF should always have a total area of 1 under the curve. This is true using dnorm, but note that when the standard deviation is not 1, the height of the curve is no longer a direct probability. For example, here the standard deviation is 10.

data <- dnorm(seq(-30, 30, 0.05), mean=0, sd=10)
x <- seq(-30, 30, 0.05)
plot(x, data, type="l", xlab="S.D.s from mean X S.D.", ylab="Prob./S.D.")

plot of chunk unnamed-chunk-3

Now the x scale must go from -30 to 30 to be equivalent to the +/- 3 standard deviations shown for the standard normal distribution.

Cumulative Distribution Function (CDF)

The cumulative distribution function is also commonly known just as the distribution function. It returns the area under the curve of the PDF at a given point.

In R, the CDF name begins with “p”.

For example:

data <- dnorm(seq(-3, 3, 0.05))
x <- seq(-3, 3, 0.05)
plot(x, data, type="l", xlab="S.D.s from Mean", ylab="Prob.")

abline(v=1, col="red")

plot of chunk unnamed-chunk-4

dnorm(x=1)

## [1] 0.242

pnorm(q=1)

## [1] 0.8413

The red line intersects the curve at one standard deviation above the mean of zero. pnorm(q=1) returns the area under the curve for the standard normal distribution at this point. So about 84% of the area under the curve lies to the left of the red line. And dnorm(x=1) shows that there is a 0.242 probability of a value one standard deviation above the mean.

If we want to know the area to the right of the red line:

pnorm(q=1, lower.tail=FALSE)

## [1] 0.1587

Here is a plot of the values of the CDF for the standard normal distribution.

data <- seq(-3, 3, 0.01)
plot(data, pnorm(data), type="l")
abline(v=0, col="red")

plot of chunk unnamed-chunk-6

The red line is at the mean, and so this indicates, as expected, that half of the area under the PDF curve is below the mean, and half above. It also demonstrates that, below -2 standard deviations and above 2, there is very little area under the PDF curve. This corresponds to the common use of the 95% confidence interval, which measures the area of the PDF from -1.96 to +1.96 standard deviations.

Quantile Function

Inverse of the PDF. Expand.

Random Generation Function

The random generation function is something provided by R to generate random data for a given distribution. The function name begins with “r”: rnorm, rbinom, runif etc.

Here is an example of randomly-generated data for a standard normal distribution:

data <- rnorm(n=100, mean=0, sd=1)
hist(data)

plot of chunk unnamed-chunk-7

The histogram looks roughly normal; the larger the data set generated, the more normal it will look:

hist(rnorm(n=1000))

plot of chunk unnamed-chunk-8

hist(rnorm(n=10000))

plot of chunk unnamed-chunk-8

hist(rnorm(n=1000000))

plot of chunk unnamed-chunk-8

A Note about Quantiles

Quantiles can be expressed in various terms (if I understand correctly). A common form is percentiles; so a quantitle of 0.99 means a value is in the 99th percentile. But quantiles can be, and in these function, are, expressed as standard deviations. So QQ (quantile-quantile) plot might show an x scale of -2.5 to 2.5. This means that a given value is somewhere between 2.5 sd’s below, to 2.5 sd’s above, the mean. (Or, if it’s not a normal distribution, the median.)

Normal (Gaussian) Distribution

The density function of the normal distribution accepts a vector of values for x, and optionally a mean (defaults to 0), standard deviation (defaults to 1), and lower.tail (defaults to TRUE). A very simple example:

dnorm(0.9)

## [1] 0.2661

This shows that a value of 0.9, in a normal distribution with mean=0 and sd=1, corresponds to a curve height of 0.26. There is therefore about a 26% probability that a given value in this distribution is 0.9.

Now here are several values from dnorm, for a sequence of values from -2.5 to 2.5. The familiar bell curve becomes visible.

x <- seq(-3, 3, 0.1)
plot(x, dnorm(x), type="l", xlab="S.D.s from Mean", ylab="Prob.")

plot of chunk unnamed-chunk-10

This shows that the probability that a value in the distribution will be more than 2 deviations from the mean, in either direction, is low, and that there is about a 40% probability that a value will be the mean.

The examples above are for a standard normal population, but different means and standard deviations can be supplied. For example, suppose that the mean score for the SAT is 1500 and the sd is 300.

mean = 1200
sd = 300
data <- seq(mean - (4*sd), mean + (4*sd), 20)
plot(data, dnorm(data, mean=mean, sd=sd), type="l")

plot of chunk unnamed-chunk-11

Note the change in the scale of the y axis. It is no longer equivalent to a probability; instead, that probability has been divided by the standard deviation. Not sure how best to explain this.

Geometric Distribution

When to use: You need to know the expected number of trials until the first success.

Density function

Here is the probability density for the geometric distribution over the interval 0 to 10 when probability is 0.5.

barplot(dgeom(x=seq(0,10), prob=0.5))

plot of chunk unnamed-chunk-12 Here is the probablity density for the geometric distribution over the interval 0 to 20 when probability is 0.1.

barplot(dgeom(x=seq(0,10), prob=0.1))

plot of chunk unnamed-chunk-13

Cumulative density function

The cumulative density function gives a cumulative probability for all values in a range from 0 to a quantile q. For example, here is the probability of the first heads coming in 1, 2, 3, 4, 5, or 6 flips.

pgeom(q=5, prob=0.5)

## [1] 0.9844

The distribution of the cumulative density function looks logarithmic.

barplot(pgeom(q=seq(0,10), prob=0.5))

plot of chunk unnamed-chunk-15

Quantile function

The quantile function is the inverse of the cumulative density function.

qgeom(p=0.984375, prob=0.5)

## [1] 5

Random data samples

The rgeom function is not entirely intuitive. It takes two parameters, n (the number of values to be created) and prob, the probabiity of a success. Each value returned is the number of failures expected before a success.

For example, if prob is 1/6, this is equivalent to rolling a 6 (or any other specified value) on single die. If a value returned is 3, it means that the first three die rolls were not a 6; so it took 4 tries to get a six.

If prob is close to 0, then the values returned will tend to be higher (and probably no zeroes). If prob is close to 1, many of the values returned will be 0.

set.seed(1)
rgeom(10, 0.1)

##  [1]  6  3 23  3 24 13 15  2 20  3

rgeom(10, 0.8)

##  [1] 0 0 0 0 1 0 0 0 0 0

If a probability for success is 0.35, then to calculate the probability that the first success will occur in 4 trials:

pgeom(q=3, prob=0.35)

## [1] 0.8215

Note that q is one less than the number of trials being tested. This says, return the probability that there will be q failures before the first success.

Binomial Distribution

When to use: Given a fixed number of trials, you need to know the probability of a certain number of successes, or the number of anticipated successes for a certain probability.

The probability of k successes in n trials is:

\[p^k(1-p)^{n-k}\]

There can be multiple ways of those k successes occurring. For example, if k is 1, then it can occur in any one of the n trials. So the total probability of k successes in n trials is:

\[{n \choose k} p^k(1-p)^{n-k}\]

Where:

\[{n \choose k} = \frac{n!}{k!(n-k)!}\]

The mean of the number of observed successes is: \[\mu = np\]

The variance of the number of observed successes is: \[np(1-p)\]

And so the standard deviation of the number of observed successes is: \[\sqrt{np(1-p)}\]

pnorm shows the probability (as a percentile) of having q or fewer successes in size trials. For example:

pbinom(q=1, size=4, prob=0.35)

## [1] 0.563

This says that, in four trials when the probability of success is 0.35, there is about a 0.563 probability of having zero or one successes. If you want to know the probability of having exactly one success:

pbinom(q=1, size=4, prob=0.35) - pbinom(q=0, size=4, prob=0.35)

## [1] 0.3845

This is the number of trials expected before the first success:

pbinom(q=4, size=4, prob=0.35) - pbinom(q=0, size=4, prob=0.35)

## [1] 0.8215

Note that this can be done with pgeom, as shown above. The following, again, shows the probability that there will be q failures before the first success:

pgeom(q=3, prob=0.35)

## [1] 0.8215

This is also equivalent to:

0.65^0*0.35 + 0.65^1*0.35 + 0.65^2 * 0.35 + 0.65^3 * 0.35

## [1] 0.8215

The latter is the probability of a success on the firsttrial, plus the same on the second, third and fourth trials.

Approximation of the normal to the binomial

Need to fill in more but here are the basics.

Suppose that the probability of someone smoking is 0.2. A sample of 400 people is drawn. What is the probability that no more than 59 are smokers?

To do this with a binomial distribution, we must calculate the probability of 0, 1, 2…59 smokers individually, and sum them. Or we can use R.

pbinom(59, 400, 0.2)

## [1] 0.004112

Summing up all of the individual probabilities is tedious. Or might I say, was tedious, back in the dark ages before computers could do things like this instantly. But the canon of statistical teaching still includes using the normal distribution to approximate the binomial distribution. We can use a mean of:

\[\mu = np\]

And a standard deviation of:

\[\sqrt{np(1-p)}\]

And so:

n = 400
p = 0.2
mu = n * p
sd = sqrt(n * p * (1-p))
pnorm(59, mu, sd)

## [1] 0.004332

The values are similar but not identical. Need to explain/illustrate why.

Negative Binomial Distribution

When to use: Given a fixed number of successes, you need to know the anticipated number of trials, given a certain probability, or the probability of a certain number of trials.

Very quickly. The probability of four successes in six trials, when probability of success is 0.8:

n=6
k=4
dnbinom(x=n-k, size=k, prob=0.8)

## [1] 0.1638

Note that binomial distribution is based on a fixed number of trials; negative binomial is based on a fixed number of successes.

The following shows the probability of getting four successes in 4, 5, or 6 trials (I think):

pnbinom(q=n-k, size=k, prob=0.8)

## [1] 0.9011

sum(dnbinom(x=0:2, size=k, prob=0.8))

## [1] 0.9011

Poisson Distribution

\[P(\text{observe }k\text{ rare events})= \frac{\lambda^ke^{-k}}{k!}\]

Consider that a rare event occurs an average of 4.4 times per given period. What is the probability of it occuring 5 times?

lambda=4.4
k=5
(lambda^k * exp(-lambda))/factorial(k)

## [1] 0.1687

dpois(5, 4.4)

## [1] 0.1687

Probability of occurring 5 or fewer times:

sum(dpois(0:5, 4.4))

## [1] 0.7199

ppois(5, 4.4)

## [1] 0.7199

And more than 5:

1-sum(dpois(0:5, 4.4))

## [1] 0.2801

ppois(5, 4.4, lower.tail=FALSE)

## [1] 0.2801

Log-Normal Distribution

See plnorm

Chi-Square Distribution

This distribution is not covered in chapter 3 of the DASI text; it is introduced in chapter 6, in the context of comparing 3 or more values for a categorical variable.

I ran a simple experiment. Imagine a sample that has 40 positives and 60 negatives. The population has a proportion of 0.5 for each. Chi-square test shouldn’t be done (I think) on only two values/bins, but let’s try it.

tmp <- chisq.test(x=c(40, 60), p=c(0.5, 0.5))
tmp$p.value

## [1] 0.0455

The p-value is a bit below 0.05 so we conclude (in this perhaps invalid test) that the sample is biased and does not adequately represent the population.

Now do a hypothesis test on the proportion, with the null hypothesis that the true proportion is 0.5.

n <- 100
point.est <- 0.4
null.prop <- 0.5
se <- sqrt((null.prop*(1-null.prop))/n)
z.score <- (point.est - null.prop)/se
pnorm(z.score) * 2 # double-sided test

## [1] 0.0455

The p-values are identical. So maybe it’s OK to use chi-square for less than three bins?

Also note that chi-square test seems to presume 95% confidence level, and I can’t find a way to alter that. You could however take the $X^2$ value and degrees of freedom and determine a p-value for any confidence level.

Exercises

3.6.1 Normal distribution

3.1 Area under the curve, I.

What percent of a standard normal distribution $N(\mu=0, \sigma=1)$ is found in each region?

Z < -1.35

pnorm(q=-1.35, mean=0, sd=1)

## [1] 0.08851

Z > 1.48

pnorm(q=1.48, mean=0, sd=1, lower.tail=FALSE)

## [1] 0.06944

-0.4 < Z < 1.5

pnorm(q=1.5, mean=0, sd=1) - pnorm(q=-0.4, mean=0, sd=1)

## [1] 0.5886

|Z| > 2

pnorm(q=-2, mean=0, sd=1) + 
  pnorm(q=2, mean=0, sd=1, lower.tail=FALSE)

## [1] 0.0455

3.2 Area under the curve II.

What percent of a standard normal distribution $N(\mu=0, sd=1)$ is found in each region?

Z > -1.13

pnorm(q=-1.13, mean=0, sd=1, lower.tail=FALSE)

## [1] 0.8708

Z < 0.18

pnorm(q=0.18, mean=0, sd=1)

## [1] 0.5714

Z > 8

pnorm(q=8, mean=0, sd=1, lower.tail=FALSE)

## [1] 6.221e-16

|Z| < 0.5

1 - (pnorm(-0.5, mean=0, sd=1) +
  pnorm(0.5, mean=0, sd=1, lower.tail=FALSE))

## [1] 0.3829

#or
pnorm(0.5, mean=0, sd=1) - pnorm(-0.5, mean=0, sd=1)

## [1] 0.3829

3.3 Scores on the GRE, Part I

A college senior who took the GRE scored 620 on the verbal reasoning section and 670 on the quantitative reasoning section. The mean score for the verbal reasoning section was 462 with a standard deviation of 119, and the mean score for the quantitative reasoning section was 584 with a standard deviation of 151. Suppose that both distributions are nearly normal.

Write down the shorthand for both these distributions

Verbal: $N(\mu=462, \sigma=119)$

Quantitiative: $N(\mu=584, \sigma=151)$

What is Z score on the verbal reasoning section?

(620-462)/119

## [1] 1.328

On the quantitative reasoning section?

(670-584)/151

## [1] 0.5695

Draw a standard normal distribution curve and mark these two Z scores.

x <- seq(-3, 3, length=100)
y <- dnorm(x)
plot(x, y, type="l", lty=1)
abline(v=(620-462)/119, col="blue")
abline(v=(670-584)/151, col="red")

plot of chunk unnamed-chunk-43

What do these Z scores tell you?

She scored 1.328 standard deviations above the mean in verbal and 0.5695 standard deviations above the mean in quantitative.

Relative to others, which did she do better on?

Verbal

Find her percentile scores for the two exams.

#verbal:
pnorm(q=620, mean=462, sd=119)

## [1] 0.9079

#or
pnorm(q=(620-462)/119, mean=0, sd=1)

## [1] 0.9079

#quantitative:
pnorm(q=670, mean=584, sd=151)

## [1] 0.7155

#or
pnorm(q=(670-584)/151, mean=0, sd=1)

## [1] 0.7155

What percent of the test takers did better than she in the verbal?

100 - (pnorm(q=620, mean=461, sd=119)*100)

## [1] 9.075

Quantitative?

100 - (pnorm(q=670, mean=584, sd=151)*100)

## [1] 28.45

Explain why simply comparing her raw scores from the two sections would lead to the incorrect conclusion that she did better on the quantitative reasoning section.

It depends on what you mean by an “incorrect conclusion”. You might very well, correctly, conclude that she did in fact perform better on the quantitative reasoning section, if your criterion is percentage of all points obtained. Let me put it this way. Do you want a brain surgeon who performed better than 99% of all students on literary criticism exam but only 1% of all students on a brain surgery exam? I could make this even more extreme. Suppose that the mean score on the literary criticism exam is 3 (out of 100), and the mean score on the brain surgery exam is 90 (out of 100). We see that as a rule, brain surgeons know nothing about literary criticism, but they know a lot about brain surgery. Our “winner” can talk all day about Derrida but doesn’t know what a medulla oblongata is.

However, if our criterion is to compare a student’s score to the performance of the population of students (and we assume that this population is normal), then we should look for the higher Z score, which tells us how many standard deviations above the mean a student scored.

If the distributions of the scores on these exams are not nearly normal, would your answers to parts b-f change? Explain your reasoning.

Considering that Z scores no longer have meaning in comparison to each other, then yes, my answers would generally be “dunno”.

3.4 Triathlon times, part I.

In triathlons, it is common for reacers to be placed into age and gender groups. Friends Leo and Mary both completed the Hermosa Beach triathlon, where Leo competed in the Men Ages 30-34 group while Mary competed in the Women Ages 25-29 group. Leo completed the race in 1:22:28 (4948 seconds) while Mary completed the race in 1:31:53 (5513 seconds). Obviously, Leo finished faster, but they are curious how they did within their respective groups. Here is the information on the performance of their groups:

Men Ages 30-34 mean=4313, sd=583

Women Ages 25-29 mean=5261, sd=807

Distributions for both are approximately normal

Write down the shorthand for these distributions.

Men Ages 30-34: $N(\mu=4313, sd=583)$

Women Ages 25-29: $N(\mu=5261, sd=807)$

What are the Z scores for Leo’s and Mary’s finishing times?

leo.z = (4948-4313)/583
mary.z = (5513-5261)/807
leo.z

## [1] 1.089

mary.z

## [1] 0.3123

Did Leo or Mary rank better in their respective groups?

if (leo.z > mary.z) {
  print("Leo")
}

## [1] "Leo"

if (leo.z < mary.z) {
  print("Mary")
}

What percent of the triathletes did Leo finish faster than in his group?

100 - (pnorm(q=leo.z)*100)

## [1] 13.8

What percent of the triathletes did Mary finish faster than in her group?

100 - (pnorm(q=mary.z)*100)

## [1] 37.74

If you can’t assume anything about anything, does this mean that you would assume anything about anything.

DUH

3.5 GRE scores part II

In exercise 3.3 we saw two distributions for GRE scores, $N(\mu=462, \sigma=119)$ for verbal and $N(\mu=584, \sigma=151)$ for quantitative.

The score of a student who scored in the 80th percentile on quantitative.

qnorm(p=0.8, mean=462, sd=119)

## [1] 562.2

The score of a student who scored worse than 70% of the test takers in the verbal.

qnorm(0.7, mean=584, sd=151)

## [1] 663.2

3.6 Triathlon times, part II.

In exercise 3.4, we saw two distributions for triathlon times: $N(\mu=4313, \sigma=583)$ for Men ages 30-34 and $N(\mu=5261, \sigma=807)$ for Women ages 25-29.

The cutoff time for the fastest 5% of athletes in the men’s group.

qnorm(0.05, mean=4313, sd=583)

## [1] 3354

The cutoff time for the slowest 10% in the women’s group.

qnorm(0.9, mean=5261, sd=807)

## [1] 6295

3.7 Temperatures in LA, part I

The average daily high temperature in June in LA is 77F with a standard deviation of 5F. Suppose the temperatures closely follow a normal distribution.

What is the probability of observing an 83F temperature in LA during a randomly chosen day in June?

pnorm(83, mean=77, sd=5, lower.tail=FALSE)

## [1] 0.1151

How cold are the coldest 10% of days during June in LA?

qnorm(0.1, mean=77, sd=5)

## [1] 70.59

3.8 Portofolio returns

The Capital Asset Pricing Model is a financial model that assumes returns on a portfolio are normally distributed. Suppose a portfolio has an average annual return of 14.7% (i.e. an average gain of 14.7%) with a standard deviation of 33%. A return of 0% means the value of the portfolio doesn’t change, a negative return means that the portfolio loses money, and a positive return means that the portfolio gains money.

What percent of years does this portfolio lose money, i.e., have a return rate less than zero?

pnorm(q=0, mean=0.147, sd=0.33)

## [1] 0.328

What is the cutoff for the highest 15% of annual returns for this portfolio?

qnorm(0.85, mean=0.147, sd=0.33)

## [1] 0.489

3.9 Temperatures in LA part II

Exercise 3.7 states that the averge daily high temperatures in LA in June is 77F with a standard deviation of 5F, and it can be assumed they follow normal distribution. We use the following to convert F to C:

\[C = (F-32) \cdot \frac{5}{9}\]

Write the probability model for the distribution of temperature in C in June in LA

\[N(\mu=25, \sigma=2.78)\]

What is the probability of observing 28C or higher in LA in June?

pnorm(q=28, mean=25, sd=2.78, lower.tail=FALSE)

## [1] 0.1403

Did you get same or different answers in part b of this question and part a of exercise 3.7?

No. 28C “roughly” corresponds to 83F but is not an exact match. 83F is closer to 28.333C. If I use the latter, I get the same result.

3.10 Heights of 10 year olds

Heights of 10 year olds, regardless of gender, closely follow a normal distribution with mean 55 inches and standard deviation 6 inches.

What is probability that randomly chosen 10 year old is shorter than 48 inches?

pnorm(q=48, mean=55, sd=6)

## [1] 0.1217

What is probability that randomly chosen 10 year old is between 60 and 65 inches?

pnorm(q=65, mean=55, sd=6) - pnorm(60, mean=55, sd=6)

## [1] 0.1545

If the tallest 10% of the class is considered “very tall”, what is the height cutoff for “very tall”?

qnorm(p=0.9, mean=55, sd=6)

## [1] 62.69

The height requirement for Batman the Ride at Six Flags is 54 inches. What percent of 10 year olds cannot go on this ride?

pnorm(q=54, mean=55, sd=6)

## [1] 0.4338

3.11 Auto insurance premiums

Suppose a newspaper article states that the distribution of auto insurance premiums for residents of California is approximately normal with a mean of $1,650. The article also states that 25% of California residents pay more than $1,800.

What is the Z score that corresponds to the top 25% (or the 75th percentile) of the standard normal distribution?

auto.z <- qnorm(p=0.75, mean=0, sd=1)
auto.z

## [1] 0.6745

What is the mean insurance cost?

1650

What is the cutoff for the 75th percentile?

auto.sd = (1800-1650)/auto.z
qnorm(p=0.75, mean=1650, sd=auto.sd)

## [1] 1800

The above calculation is of course unnecessary; all that needs to be done is to restate what was given, 1800.

Identify the standard deviation of insurance premiums in LA.

auto.sd

## [1] 222.4

3.12 Speeding on the I-5, part I

The distribution of passenger vehicle speeds traveling on the Interstate 5 Freeway (I-5) in California is nearly normal with a mean of 72.6 miles/hour and a standard deviation of 4.78 miles/hour.

What percentage of passenger vehicles travel slower then 80 mph?

pnorm(q=80, mean=72.6, sd=4.78)

## [1] 0.9392

What percentage of passenger vehicles travel between 60 and 80 mph?

pnorm(q=80, mean=72.6, sd=4.78) - pnorm(q=60, mean=72.6, sd=4.78)

## [1] 0.935

How fast do the fastest 5% of passenger vehicles travel?

qnorm(p=0.95, mean=72.6, sd=4.78)

## [1] 80.46

The speed limit on this stretch of the I-5 is 70 mph. Approximately what percentage of the passenger vehicles travel above the speed limit.

pnorm(q=70, mean=72.6, sd=4.78, lower.tail=FALSE)

## [1] 0.7068

3.13 Overweight baggage

Suppose weights of the checked baggage of airline passengers follow a nearly normal distribution with mean 45 pounds and standard deviation 3.2 pounds. Most airlines charge a fee for baggage that weigh in excess of 50 pounds. Determine what percent of airline passengers incur this fee.

pnorm(q=50, mean=45, sd=3.2, lower.tail=FALSE)

## [1] 0.05909

3.14 Find the SD

MENSA is an organization whose members have IQs in the top 2% of the population. IQs are normally distributed with mean 100, and the minimum IQ score required for admission to MENSA is 132.

mensa.z <- qnorm(0.98, mean=0, sd=1)
mensa.sd <- (132-100)/mensa.z
mensa.sd

## [1] 15.58

Cholesterol levels for women aged 20 to 34 follow an approximately normal distribution with mean 185 milligrams per deciliter (mg/dl). Women with cholesterol above 220 mg/dl are considered to have high cholesterol, and about 18.5% of women fall into this category.

chol.z <- qnorm(p=(1-0.185), mean=0, sd=1)
chol.sd <- (220 - 185) / chol.z
chol.sd

## [1] 39.04

3.15 Buying books on ebay

The textbook you need to buy for your chemistry class is expensive at the college bookstore, so you consider buying it on Ebay instead. A look at past auctions suggest that the prices of that chemistry textbook have an approximately normal distribution with mean $89 and standard deviation $15.

What is the probability that a randomly selected auction for this book closes at more than $100?

pnorm(q=100, mean=89, sd=15, lower.tail=FALSE)

## [1] 0.2317

Ebay allows you to set your maximum bid price so that if someone outbids you on an auction you can automatically outbid them, up to the maximum bid price you set. If you are only bidding on one auction, what are the advantages and disadvantages of setting a bid price too high or too low? What if you are bidding on multiple auctions?

Too high: You ensure you get the item, but you might overpay. If you are doing this on many items, you will presumably not overpay significantly all that often.

Too low: You probably won’t get the item, but if you do, it’s for a very low price. If you do this on many items, you have a high probability of getting some good deals, but also you will almost certainly not get many items.

If you watched 10 auctions, roughly what percentile might you use for a maximum bid cutoff to be somewhat sure that you will win one of these ten auctions? Is it possible to find a cutoff point that will ensure that you win an auction?

Choosing a percentile of 90% is probably fairly safe.

No, you cannot “ensure”, only increase probability. If you set a maximum of $10,000 you will almost certainly win, but if you are crazy enough to bid that much, somebody else could be crazy enough to bid even more.

If you are willing to track up to ten auctions closely, about what price might you use as your maximum bid price if you want to be somewhat sure that you will buy one of these ten books?

qnorm(p=0.9, mean=89, sd=15)

## [1] 108.2

3.16 SAT scores

SAT scores (out of 2400) are distributed normally with a mean of 1500 and a standard deviation of 300. Suppose a school council awards a certificate of excellence to all students who score at least 1900 on the SAT, and suppose we pick one of the recognized students at random. What is the probability this student’s score will be at least 2100?

pnorm(q=1900, mean=1500, sd=300)

## [1] 0.9088

3.17 Scores on stats final, part I

Below are final exam score of 20 Introductory Statistics students.

stats.scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)

The mean score is 77.7 with a stadard deviation of 8.44 points. Use this info to determine whether scores approximately follow 68-95-99.7% rule.

#68% of 20 tests is about 14 tests.
min.68 <- stats.scores[4]
max.68 <- stats.scores[17]

min.68

## [1] 71

qnorm(p=0.34, mean=77.7, sd=8.44)

## [1] 74.22

max.68

## [1] 83

qnorm(p=0.76, mean=77.7, sd=8.44)

## [1] 83.66

min.95 <- stats.scores[1]
max.95 <- stats.scores[20]

min.95

## [1] 57

qnorm(p=0.05, mean=77.7, sd=8.44)

## [1] 63.82

max.95

## [1] 94

qnorm(p=0.95, mean=77.7, sd=8.44)

## [1] 91.58

There are insufficient data points to test the 99.7% part of the rule. But at 68% and 95%, the data looks to conform fairly well. The value at 5% was not a terribly close match, but we can probably attribute this to small sample size.

3.18 Heights of female college students part I

Below are heights of 25 college female students.

heights <- c(54, 55, 56, 56, 57, 58, 58, 59, 60, 60, 60, 61, 61, 62, 62, 63, 63, 63, 64, 65, 65, 67, 67, 69, 73)

The mean height is 61.52 inches with a standard deviation of 4.58 inches. Use this info to determine whether the heights approximately follow the 68-95-99.7% rule.

#68% of 25 is 17, so our min and max points are 5 and 21
min.68 <- heights[5]
max.68 <- heights[21]
min.68

## [1] 57

qnorm(p=0.34, mean=61.52, sd=4.58)

## [1] 59.63

max.68

## [1] 65

qnorm(p=0.68, mean=61.52, sd=4.58)

## [1] 63.66

#closest approximation to 95% is first and last data points
min.95 <- heights[1]
max.95 <- heights[25]

min.95

## [1] 54

qnorm(0.05, mean=61.52, sd=4.58)

## [1] 53.99

max.95

## [1] 73

qnorm(0.95, mean=61.52, sd=4.58)

## [1] 69.05

No reasonable approximation for 99.7%, but 68 and 95 are reasonably close.

3.6.2 Evaluating the normal approximation

3.19 Scores on stats final, part II

3.19 Exercise 3.17 lists the final exam scores of 20 Introductory Statistics students. Do these data appear to follow a normal distribution?

hist(stats.scores)

plot of chunk unnamed-chunk-81

qqnorm(stats.scores)

plot of chunk unnamed-chunk-81

Yes, especially considering the small sample size, the data looks reasonably like a normal distribution.

3.20 Heights of female college students, part II

Exercise 3.18 lists the heights of 25 female college students. Do these data appear to follow a normal distribution?

hist(heights)

plot of chunk unnamed-chunk-82

qqnorm(heights)

plot of chunk unnamed-chunk-82

This data set appears to be right-skewed and so may not follow a normal distribution. More data would clarify.

3.6.3 Geometric distribution

3.21 Is it Bernoulli?

Determine if each trial can be considered a Bernoulli trial for the following situations.

Cards dealt in a hand of poker

No. A poker hand is never a “success-fail” trial on its own.

Outcome of each roll of a die.

Again no. There are six possible outcomes. If the trial is “a 6 or not a 6” then it would be a Bernoulli trial.

3.22 With and without replacement

In the following situations assume that half the population is male and the other half is female.

Suppose you’re sampling from a room with 10 people. What is the probability of sampling two females in a row when sampling with replacement? What is the probability when sampling without replacement?

0.5 and

4/9

## [1] 0.4444

Now suppose you’re sampling from a stadium with 10,000 people. What is the probability of sampling two females in a row when sampling with replacement? What is the probability when sampling without replacement?

0.5 and

4999/9999

## [1] 0.4999

We often treat individuals who are sampled from a large population as independent. Using your findings from parts (a) and (b), explain whether or not this assumption is reasonable.

Yes it is reasonable. In a large population, when presumably the sample size is a tiny percentage of the population, it makes very little difference whether the sampling is done with or without replacement.

3.23 Married women

The 2010 American Community Survey estimates that 47.1% of women ages 15 years and over are married.

We raondomly select three women between these ages. What is the probability that the third women to be selected is the only one who is married?

pbinom(q=1, size=3, prob=0.471) - pbinom(q=0, size=3, prob=0.471)

## [1] 0.3954

#or
dbinom(x=1, size=3, prob=0.471)

## [1] 0.3954

What is the probability that all three randomly selected women are married?

dbinom(3, 3, 0.471)

## [1] 0.1045

On average, how many women would you expect to sample before selecting a married woman? What is the standard deviation?

p = 0.471
mu <- 1 / p
mu

## [1] 2.123

sd = sqrt((1-p)/p^2)
sd

## [1] 1.544

If the population of married women was actually 30%, how many women would you expect to sample before selecting a married woman? What is the standard deviation?

p <- 0.3
1 / p

## [1] 3.333

sqrt((1-p)/p^2)

## [1] 2.789

Based on your answers to parts c and d, how does decreasing the probability of an event affect the mean and standard deviation of the wait time until success?

The mean goes up; it takes more trials to get a success. The standard deviation also goes up, because the mean (which is less than 1) goes down, and thus the numerator in the formula for sd goes up, while the denominator goes down.

3.24 Defective rate

A machine that produces a special type of transistor (a component of computers) has a 2% defective rate. The production is considered a random process where each transistor is independent of the others.

What is the probability that the 10th transistor produced is the first with a defect?

#x = # trials - # successes, size=# successes
dnbinom(x=9, size=1, prob=0.02)

## [1] 0.01667

#or, x=number of failures before success
dgeom(x=9, prob=0.02)

## [1] 0.01667

What is the probability that the machine produces no defective transistors in a batch of 100?

0.98^100

## [1] 0.1326

#or
dgeom(x=100, prob=0.02)/0.02 #the /0.02 undoes the last successful trial

## [1] 0.1326

On average, how many transistors would you expect to be produced before the first with a defect? What is the standard deviation?

p = 0.02
mu = 1/p
mu

## [1] 50

sd = sqrt((1-p)/p^2)
sd

## [1] 49.5

Another machine that also produces transistors has a 5% defective rate where each transistor is produced independent of the others. On average how many transistors would you expect to be produced with this machine before the first with a defect? What is the standard deviation?

p = 0.05
mu = 1 / p
mu

## [1] 20

sd = sqrt((1-p)/p^2)
sd

## [1] 19.49

Based on your answers to parts (c) and (d), how does increasing the probability of an event affect the mean and standard deviation of the wait time until success?

Both decrease

3.25 Eye color part I

A husband and wife both have brown eyes but carry genes that make it possible for their children to have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).

What is the probability the first blue-eyed child they have is their third child? Assume that the eye colors of the children are independent of each other.

dgeom(x=2, prob=0.125)

## [1] 0.0957

On average, how many children would such a pair of parents have before having a blue-eyed child? What is the standard deviation of the number of children they would expect to have until the first blue-eyed child?

p <- 0.125
mu <- 1/p
mu

## [1] 8

sd <- sqrt((1-p)/p^2)
sd

## [1] 7.483

3.26 Speeding on the I-5 part II

Exercise 3.12 states that the distribution of speeds of cars traveling on the Interstate 5 Freeway (I-5) in California is nearly normal with a mean of 72.6 miles/hour and a standard deviation of 4.78 miles/hour. The speed limit on this stretch of the I-5 is 70 miles/hour.

A highway patrol officer is hidden on the side of the freeway. What is the probability that 5 cars pass and none are speeding? Assume that the speeds of the cars are independent of each other.

z <- (72.6 - 70) / 4.78
p <- pnorm(z)
p^5

## [1] 0.1763

On average, how many cars would the highway patrol officer expect to watch until the first car that is speeding? What is the standard deviation of the number of cars he would expect to watch?

mu <- 1 / p
mu

## [1] 1.415

sd <- sqrt((1-p)/p^2)
sd

## [1] 0.7662

3.6.4 Binomial distribution

3.27 Underage drinking part I

The Substance Abuse and Mental Health Services Administration estimated that 70% of 18-20 year olds consumed alcoholic beverages in 2008.

Suppose a random sample of ten 18-20 year olds is taken. Is the use of the binomial distribution appropriate for calculating the probability that exactly six consumed alcoholic beverages? Explain.

The conditions: 1) Fixed number of trials, 2) trials are independent, 3) each trial is success/failure, 4) equal probability of success in each trial.

Yes this appears to meet all of the conditions.

Calculate the probability that exactly 6 out of 10 randomly sampled 18-20 year olds consumed an alcoholic drink.

dbinom(x=6, size=10, prob=0.7)

## [1] 0.2001

What is the probability that exactly four out of the ten 18-20 year olds have not consumed an alcoholic beverage?

dbinom(x=4, size=10, prob=0.3)

## [1] 0.2001

What is the probability that at most 2 out of 5 randomly sampled 18-20 year olds have consumed alcoholic beverages?

pbinom(q=2, size=5, prob=0.7)

## [1] 0.1631

What is the probability that at least 1 out of 5 randomly sampled 18-20 year olds have consumed alcoholic beverages?

pbinom(q=1, size=5, prob=0.7, lower.tail=FALSE)

## [1] 0.9692

3.28 Chickenpox part I

The National Vaccine Information Center estimates that 90% of Americans have had chickenpox by the time they reach adulthood.

Suppose we take a random sample of 100 American adults. Is the use of the binomial distribution appropriate for calculating the probability that exactly 97 had chickenpox before they reached adulthood? Explain.

Yes, because 1) number of trials fixed, 2) trials are idependent, 3) success/fail, 4) equal probability of success in each trial.

Calculate the probability that exactly 97 out of 100 randomly sampled American adults had chickenpox during childhood.

dbinom(x=97, size=100, prob=0.9)

## [1] 0.005892

What is the probability that exactly 3 out of a new sample of 100 American adults have not had chickenpox in their childhood?

dbinom(x=3, size=100, prob=0.1)

## [1] 0.005892

What is the probability that at least 1 out of 10 randomly sampled American adults have had chickenpox?

pbinom(q=1, size=10, prob=0.9, lower.tail=FALSE)

## [1] 1

What is the probability that at most 3 out of 10 randomly sampled American adults have not had chickenpox?

pbinom(q=3, size=10, prob=0.1)

## [1] 0.9872

3.29 Underage drinking part II

We learned in Exercise 3.27 that about 70% of 18-20 year olds consumed alcoholic beverages in 2008. We now consider a random sample of fifty 18-20 year olds.

How many people would you expect to have consumed alcoholic beverages? And with what standard deviation?

p <- 0.7
n <- 50
mu <- n*p
mu

## [1] 35

sd <- sqrt(n*p*(1-p))
sd

## [1] 3.24

Would you be surprised if there were 45 or more people who have consumed alcoholic beverages?

z <- abs(45-mu) / sd
if (z > 1) {
  print("Yes, the probability is low that 45 or more of 50 would have consumed alcohol:")
  pnorm(z, lower.tail=FALSE)
} else {
  print("No, the probability is reasonable that 45 or more out of 50 would have consumed alcohol:")
  pnorm(z, lower.tail=FALSE)
}

## [1] "Yes, the probability is low that 45 or more of 50 would have consumed alcohol:"

## [1] 0.001014

What is the probability that 45 or more people in this sample have consumed alcoholic beverages? How does this probability relate to your answer to part (b)?

pnorm(z, lower.tail=FALSE)

## [1] 0.001014

It’s essentially the same thing.

3.30 Chickenpox part II

We learned in Exercise 3.28 that about 90% of American adults had chickenpox before adulthood. We now consider a random sample of 120 American adults.

How many people in this sample would you expect to have had chickenpox in their childhood? And with what standard deviation?

n <- 120
p <- 0.9
mu <- n * p
mu

## [1] 108

sd <- sqrt(n*p*(1-p))
sd

## [1] 3.286

Would you be surprised if there were 105 people who have had chickenpox in their childhood?

z <- abs(105 - mu)/sd
if (z > 1) {
  print("yes")
} else {
  print("no")
}

## [1] "no"

What is the probability that 105 or fewer people in this sample have had chickenpox in their childhood?

pnorm(z, lower.tail=FALSE)

## [1] 0.1807

3.31 University admissions

Suppose a university announced that it admitted 2,500 students for the following year’s freshman class. However, the university has dorm room spots for only 1,786 freshman students. If there is a 70% chance that an admitted student will decide to accept the offer and attend this university, what is the what is the approximate probability that the university will not have enough dormitory room spots for the freshman class?

pbinom(q=1787, size=2500, prob=0.7, lower.tail=FALSE)

## [1] 0.05033

3.32 Survey response rate

Pew Research reported in 2012 that the typical response rate to their surveys is only 9%. If for a particular survey 15,000 households are contacted, what is the probability that at least 1,500 will agree to respond?

pbinom(q=1500, size=15000, prob=0.09, lower.tail=FALSE)

## [1] 1.173e-05

3.33 Game of dreidel

A dreidel is a four-sided spinning top with the Hebrew letters nun, gimel, hei, and shin, one on each side. Each side is equally likely to come up in a single spin of the dreidel. Suppose you spin a dreidel three times. Calculate the probability of getting

at least one nun?

pbinom(q=1, size=3, prob=0.25, lower.tail=FALSE)

## [1] 0.1562

exactly 2 nuns?

dbinom(x=2, size=3, prob=0.25)

## [1] 0.1406

exactly 1 hei?

dbinom(x=1, size=3, prob=0.25)

## [1] 0.4219

at most 2 gimels?

pbinom(q=2, size=3, prob=0.25)

## [1] 0.9844

3.34 Arachnophobia

A 2005 Gallup Poll found that that 7% of teenagers (ages 13 to 17) suffer from arachnophobia and are extremely afraid of spiders. At a summer camp there are 10 teenagers sleeping in each tent. Assume that these 10 teenagers are independent of each other.

Calculate the probability that at least one of them suffers from arachnophobia.

pbinom(q=1, size=10, prob=0.07, lower.tail=FALSE)

## [1] 0.1517

Calculate the probability that exactly 2 of them suffer from arachnophobia?

dbinom(x=2, size=10, prob=0.07)

## [1] 0.1234

Calculate the probability that at most 1 of them suffers from arachnophobia?

pbinom(q=1, size=10, prob=0.07)

## [1] 0.8483

If the camp counselor wants to make sure no more than 1 teenager in each tent is afraid of spiders, does it seem reasonable for him to randomly assign teenagers to tents?

p <- 0.07
n <- 10
mu <- n * p
mu

## [1] 0.7

sd <- sqrt(n*p*(1-p))
sd

## [1] 0.8068

Random assignment is expected to result in no more than 1 fearmongerer per tent, but the standard deviation is about 0.8, and it is not unreasonable to think that there will be times when 2 of them are in one tent. The counselors should ask the teens who is afraid and then assign manually.

3.35 Eye color part II

Exercise 3.25 introduces a husband and wife with brown eyes who have 0.75 probability of having children with brown eyes, 0.125 probability of having children with blue eyes, and 0.125 probability of having children with green eyes.

What is the probability that their first child will have green eyes and the second will not?

0.125*(1-0.125)

## [1] 0.1094

#or
dbinom(x=1, size=1, prob=0.125) * dbinom(x=1, size=1, prob=0.875)

## [1] 0.1094

What is the probability that exactly one of their two children will have green eyes?

dbinom(x=1, size=2, prob=0.125)

## [1] 0.2187

If they have six children, what is the probability that exactly two will have green eyes?

dbinom(x=2, size=6, prob=0.125)

## [1] 0.1374

If they have six children, what is the probability that at least one will have green eyes?

pbinom(q=1, size=6, prob=0.125, lower.tail=FALSE)

## [1] 0.1665

What is the probability that the first green eyed child will be the 4th child?

0.875^3 * 0.125^1

## [1] 0.08374

Would it be considered unusual if only 2 out of their 6 children had brown eyes?

pr <- pbinom(q=3, size=6, prob=0.75, lower.tail=FALSE)
if (pr > 0.5) {
  print(paste0("Yes ", pr))
} else {
  print(paste0("No", pr))
}

## [1] "Yes 0.83056640625"

3.36 Sickle cell anemia

Sickle cell anemia is a genetic blood disorder where red blood cells lose their flexibility and assume an abnormal, rigid, “sickle” shape, which results in a risk of various complications. If both parents are carriers of the disease, then a child has a 25% chance of having the disease, 50% chance of being a carrier, and 25% chance of neither having the disease nor being a carrier. If two parents who are carriers of the disease have 3 children, what is the probability that

two will have the disease?

dbinom(x=2, size=3, prob=0.25)

## [1] 0.1406

none will have the disease?

dbinom(x=0, size=0, prob=0.25)

## [1] 1

at least one will neither have the disease nor be a carrier?

pbinom(q=1, size=3, prob=0.5, lower.tail=FALSE)

## [1] 0.5

the first child with the disease will be the third child?

0.75^2 * 0.25^1

## [1] 0.1406

3.37 Roulette winnings

In the game of roulette, a wheel is spun and you place bets on where it will stop. One popular bet is that it will stop on a red slot; such a bet has an 18/38 chance of winning. If it stops on red, you double the money you bet. If not, you lose the money you bet. Suppose you play 3 times, each time with a $1 bet. Let Y represent the total amount won or lost. Write a probability model for Y.

p <- 18/38
n <- 3
win.0 <- dbinom(x=0, size=n, prob=p)
win.1 <- dbinom(x=1, size=n, prob=p)
win.2 <- dbinom(x=2, size=n, prob=p)
win.3 <- dbinom(x=3, size=n, prob=p)
print(paste0(win.0, " probability of losing $3"))

## [1] "0.145793847499636 probability of losing $3"

print(paste0(win.1, " probability of losing $1"))

## [1] "0.393643388249016 probability of losing $1"

print(paste0(win.2, " probability of winning $1"))

## [1] "0.354279049424114 probability of winning $1"

print(paste0(win.3, " probability of winning $3"))

## [1] "0.106283714827234 probability of winning $3"

overall <- (win.0 * -3) + (win.1 * -1) + (win.2 * 1) + (win.3 * 3)
print(paste0("Overall expected win/loss of ", overall))

## [1] "Overall expected win/loss of -0.157894736842106"

#or
mu <- n * p
overall <- (3 * -1) + (mu * 2)
overall

## [1] -0.1579

3.38 Multiple choice quiz

In a multiple choice quiz there are 5 questions and 4 choices for each question (a, b, c, d). Robin has not studied for the quiz at all, and decides to randomly guess the answers. What is the probability that

the first question she gets right is the 3rd question?

0.75^2 * 0.25^1

## [1] 0.1406

she gets exactly 3 or exactly 4 questions right?

sum(dbinom(x=3:4, size=5, prob=0.25))

## [1] 0.1025

she gets the majority of the questions right?

pbinom(q=3, size=5, prob=0.25, lower.tail=FALSE)

## [1] 0.01562

3.39 Exploring combinations

The formula for the number of ways to arrange n objects is

\[n! = n * (n-1) * ... * 2 * 1\]

This exercise walks you through the derivation of this formula for a couple of special cases.

A small company has five employees: Anna, Ben, Carl, Damian, and Eddy. There are five parking spots in a row at the company, none of which are assigned, and each day the employees pull into a random parking spot. That is, all possible orderings of the cars in the row of spots are equally likely.

On a given day, what is the probability that the employees park in alphabetical order?

1/factorial(5)

## [1] 0.008333

If the alphabetical order has an equal chance of occurring relative to all other possible orderings, how many ways must there be to arrange the five cars?

factorial(5)

## [1] 120

Now consider a sample of 8 employees instead. How many possible ways are there to order these 8 employees’ cars?

factorial(8)

## [1] 40320

3.40 Male children

While it is often assumed that the probabilities of having a boy or a girl are the same, the actual probability of having a boy is slightly higher at 0.51. Suppose a couple plans to have 3 kids.

Use the binomial model to calculate the probability that two of them will be boys.

dbinom(x=2, size=3, prob=0.51)

## [1] 0.3823

#or

Write out all possible orderings of 3 children, 2 of whom are boys. Use these scenarios to calculate the same probability from part (a) but using the Addition Rule for disjoint events.

B-B-G, B-G-B, G-B-B

(0.51*0.51*0.49) + (0.51*0.49*0.51) + (0.49*0.51*0.51)

## [1] 0.3823

Confirm that your answers from parts (a) and (b) match.

If we wanted to calculate the probability that a couple who plans to have 8 kids will have 3 boys, briefly describe why the approach from part (b) would be more tedious than the approach from part (a).

DUH because you have to calculate ${8 \choose 3}$ scenarios using the second method.

3.6.5 More discrete distributions

3.41 Identify the distribution

Calculate the following probabilities and indicate which probability distribution model is appropriate in each case. You roll a fair die 5 times. What is the probability of rolling

The first 6 on the fifth roll?

(5/6)^4 * (1/6)^1

## [1] 0.08038

#or
dbinom(x=1, size=5, prob=1/6) / choose(5, 1)

## [1] 0.08038

#or
dnbinom(x=4, size=1, prob=1/6)

## [1] 0.08038

#the above says, return probability of four failures and
#one success--this could also apply to success in any one of the other trials
#or
dgeom(x=4, prob=1/6)

## [1] 0.08038

#the above says, probability of four failures followed by a success

exactly three 6s?

dbinom(x=3, size=5, prob=1/6)

## [1] 0.03215

the third 6 on the fifth roll?

#first get the probability of 2 6s on the first four rolls
#then the probability of a sixth on one roll
#and multiply the two together
dbinom(x=2, size=4, prob=1/6) * dbinom(x=1, size=1, prob=1/6)

## [1] 0.01929

#or
dnbinom(x=2, size=3, prob=1/6)

## [1] 0.01929

#the above says, returns probability that there are two failures 
#and three successes, in any order

3.42 Darts

Calculate the following probabilities and indicate which probability distribution model is appropriate in each case. A very good darts player can hit the bullseye (red circle in the center of the dart board) 65% of the time. What is the probability that he

hits the bullseye for the 10th time on the 15th try?

dbinom(x=9, size=14, prob=0.65) * dbinom(x=1, size=1, prob=0.65)

## [1] 0.1416

#or
dnbinom(x=5, size=10, prob=0.65)

## [1] 0.1416

hits the bullseye 10 times in 15 tries?

dbinom(x=10, size=15, prob=0.65)

## [1] 0.2123

#or
choose(15, 10) * 0.65^10 * 0.35^5

## [1] 0.2123

hits the first bullseye on the third try?

dgeom(x=2, prob=0.65)

## [1] 0.07962

#or
0.35^2 * 0.65^1

## [1] 0.07962

3.43 Sampling at school

For a sociology class project you are asked to conduct a survey on 20 students at your school. You decide to stand outside of your dorm’s cafeteria and conduct the survey on a random sample of 20 students leaving the cafeteria after dinner one evening. Your dorm is comprised of 45% males and 55% females.

Which probability model is most appropriate for calculating the probability that the 4th person you survey is the 2nd female? Explain.

The negative binomial, which can return a probability for an expected number of failures and successes, in any order. One of those orders is with the last trial being a success.

Compute the probability from part a.

dnbinom(x=2, size=2, prob=0.55)

## [1] 0.1838

#or
0.55^2 * 0.45^2 * choose(3, 1)

## [1] 0.1838

#the fourth trial is always female, so we use choose on the first three trials
#to arrive at the number of combinations

The three possible scenarios that lead to 4th person you survey being the 2nd female are

\[\{M,M,F,F\},\{M,F,M,F\},\{F,M,M,F\}\]

One common feature among these scenarios is that the last trial is always female. In the first three trials there are 2 males and 1 female. Use the binomial coefficient to confirm that there are 3 ways of ordering 2 males and 1 female.

n <- 4
k <- 2
choose(n-1, k-1)

## [1] 3

#or
factorial(n-1)/(factorial(k-1) * factorial(n-k))

## [1] 3

Use the findings presented in part (c) to explain why the formula for the coefficient for the negative binomial is ${n-1 \choose k-1}$ while the formula for the binomial coefficient is ${n \choose k}$.

The last trial in the negative binomial is always a success so we do not need to consider it in determining the number of outcomes. It does not vary. It is always a success.

3.44 Serving in volleyball

A not-so-skilled volleyball player has a 15% chance of making the serve, which involves hitting the ball so it passes over the net on a trajectory such that it will land in the opposing team’s court. Suppose that her serves are independent of each other.

What is the probability that on the 10 th try she will make her 3rd successful serve?
Suppose she has made two successful serves in nine attempts. What is the probability that her 10th serve will be successful?
Even though parts (a) and (b) discuss the same scenario, the probabilities you calculated should be different. Can you explain the reason for this discrepancy?

3.45 Customers at a coffee shop, part I

A coffee shop serves and average of 75 customers per hour during the morning rush.

Which distribution we have studied is most appropriate for calculating the probability of a given number of customers arriving within one hour during this time of day?
What are the mean and the standard deviation of the number of customers this coffee shop serves in one hour during this time of day?
Would it be considered unusually low if only 60 customers showed up to this coffee shop in one hour during this time of day?

3.46 Stenographer’s typos, part I

A very skilled court stenographer makes one typographical error (typo) per hour on average.

What probability distribution is most appropriate for calculating the probability of a given number of typos this stenographer makes in an hour?
What are the mean and the standard deviation of the number of typos this stenographer makes?
Would it be considered unusual if this stenographer made 4 typos in a given hour?

3.47 Customers at a coffee shop part II

Exercise 3.45 gives the average number of customers visiting a particular coffee shop during the morning rush hour as 75. Calculate the probability that this coffee shop serves 70 customers in one hour during this time of day?

3.48 Stenographer’s typos part II

Exercise 3.46 gives the average number of typos of a very skilled court stenographer as 1 per hour. Calculate the probability that this stenographer makes at most 2 typos in a given hour.