(This file still a work in progress.)
All material here based on the free, excellent statistics textbook at openintro.org. These are largely my notes to serve as aide-memoire. I make no guarantees that there are no mistakes or typos, though I certainly have a motivation to avoid tripping myself up in the future with careless mistakes.
The exercises are from chapter 3 of the same text. If you intend to use R fo statistical analysis, becoming familiar, nay agile, with the functions for all of the common probability distributions is a prerequisite. The questions in these exercises should lead immediately and instantly to the proper function and parameters for that function. When you are sure-footed with that, you’re ready to move on.
Each distribution has functions for density, distribution, quantile and random generation.
The density function (commonly known as a Probability Density Function or PDF) indicates how high the distribution curve is at an exact point. The distribution function can mean different things, based on the distribution. The quantile function returns the quantile value (in standard deviations) for a specified place in a distribution.
In R, the PDF name starts with “d”: dnorm, dbinomial, dunif, etc.
Here is a simple example. The plot of the PDF for the standard normal distribution, so the mean is 0 and the standard deviation is 1. The range of probabilities is from -3 to +3 standard deviations.
set.seed(1)
data <- dnorm(seq(-3, 3, 0.05))
x <- seq(-3, 3, 0.05)
plot(x, data, type="l", xlab="S.D.s from Mean", ylab="Prob.")
This shows that the probability of a value of 0 is about 0.39. We can get specific probabilities using the PDF, which is dnorm.
dnorm(x=0)
## [1] 0.3989
A PDF should always have a total area of 1 under the curve. This is true using dnorm, but note that when the standard deviation is not 1, the height of the curve is no longer a direct probability. For example, here the standard deviation is 10.
data <- dnorm(seq(-30, 30, 0.05), mean=0, sd=10)
x <- seq(-30, 30, 0.05)
plot(x, data, type="l", xlab="S.D.s from mean X S.D.", ylab="Prob./S.D.")
Now the x scale must go from -30 to 30 to be equivalent to the +/- 3 standard deviations shown for the standard normal distribution.
The cumulative distribution function is also commonly known just as the distribution function. It returns the area under the curve of the PDF at a given point.
In R, the CDF name begins with “p”.
For example:
data <- dnorm(seq(-3, 3, 0.05))
x <- seq(-3, 3, 0.05)
plot(x, data, type="l", xlab="S.D.s from Mean", ylab="Prob.")
abline(v=1, col="red")
dnorm(x=1)
## [1] 0.242
pnorm(q=1)
## [1] 0.8413
The red line intersects the curve at one standard deviation above the mean of zero. pnorm(q=1) returns the area under the curve for the standard normal distribution at this point. So about 84% of the area under the curve lies to the left of the red line. And dnorm(x=1) shows that there is a 0.242 probability of a value one standard deviation above the mean.
If we want to know the area to the right of the red line:
pnorm(q=1, lower.tail=FALSE)
## [1] 0.1587
Here is a plot of the values of the CDF for the standard normal distribution.
data <- seq(-3, 3, 0.01)
plot(data, pnorm(data), type="l")
abline(v=0, col="red")
The red line is at the mean, and so this indicates, as expected, that half of the area under the PDF curve is below the mean, and half above. It also demonstrates that, below -2 standard deviations and above 2, there is very little area under the PDF curve. This corresponds to the common use of the 95% confidence interval, which measures the area of the PDF from -1.96 to +1.96 standard deviations.
Inverse of the PDF. Expand.
The random generation function is something provided by R to generate random data for a given distribution. The function name begins with “r”: rnorm, rbinom, runif etc.
Here is an example of randomly-generated data for a standard normal distribution:
data <- rnorm(n=100, mean=0, sd=1)
hist(data)
The histogram looks roughly normal; the larger the data set generated, the more normal it will look:
hist(rnorm(n=1000))
hist(rnorm(n=10000))
hist(rnorm(n=1000000))
Quantiles can be expressed in various terms (if I understand correctly). A common form is percentiles; so a quantitle of 0.99 means a value is in the 99th percentile. But quantiles can be, and in these function, are, expressed as standard deviations. So QQ (quantile-quantile) plot might show an x scale of -2.5 to 2.5. This means that a given value is somewhere between 2.5 sd’s below, to 2.5 sd’s above, the mean. (Or, if it’s not a normal distribution, the median.)
The density function of the normal distribution accepts a vector of values for x, and optionally a mean (defaults to 0), standard deviation (defaults to 1), and lower.tail (defaults to TRUE). A very simple example:
dnorm(0.9)
## [1] 0.2661
This shows that a value of 0.9, in a normal distribution with mean=0 and sd=1, corresponds to a curve height of 0.26. There is therefore about a 26% probability that a given value in this distribution is 0.9.
Now here are several values from dnorm, for a sequence of values from -2.5 to 2.5. The familiar bell curve becomes visible.
x <- seq(-3, 3, 0.1)
plot(x, dnorm(x), type="l", xlab="S.D.s from Mean", ylab="Prob.")
This shows that the probability that a value in the distribution will be more than 2 deviations from the mean, in either direction, is low, and that there is about a 40% probability that a value will be the mean.
The examples above are for a standard normal population, but different means and standard deviations can be supplied. For example, suppose that the mean score for the SAT is 1500 and the sd is 300.
mean = 1200
sd = 300
data <- seq(mean - (4*sd), mean + (4*sd), 20)
plot(data, dnorm(data, mean=mean, sd=sd), type="l")
Note the change in the scale of the y axis. It is no longer equivalent to a probability; instead, that probability has been divided by the standard deviation. Not sure how best to explain this.
When to use: You need to know the expected number of trials until the first success.
Here is the probability density for the geometric distribution over the interval 0 to 10 when probability is 0.5.
barplot(dgeom(x=seq(0,10), prob=0.5))
Here is the probablity density for the geometric distribution over the interval 0 to 20 when probability is 0.1.
barplot(dgeom(x=seq(0,10), prob=0.1))
The cumulative density function gives a cumulative probability for all values in a range from 0 to a quantile q. For example, here is the probability of the first heads coming in 1, 2, 3, 4, 5, or 6 flips.
pgeom(q=5, prob=0.5)
## [1] 0.9844
The distribution of the cumulative density function looks logarithmic.
barplot(pgeom(q=seq(0,10), prob=0.5))
The quantile function is the inverse of the cumulative density function.
qgeom(p=0.984375, prob=0.5)
## [1] 5
The rgeom function is not entirely intuitive. It takes two parameters, n (the number of values to be created) and prob, the probabiity of a success. Each value returned is the number of failures expected before a success.
For example, if prob is 1/6, this is equivalent to rolling a 6 (or any other specified value) on single die. If a value returned is 3, it means that the first three die rolls were not a 6; so it took 4 tries to get a six.
If prob is close to 0, then the values returned will tend to be higher (and probably no zeroes). If prob is close to 1, many of the values returned will be 0.
set.seed(1)
rgeom(10, 0.1)
## [1] 6 3 23 3 24 13 15 2 20 3
rgeom(10, 0.8)
## [1] 0 0 0 0 1 0 0 0 0 0
If a probability for success is 0.35, then to calculate the probability that the first success will occur in 4 trials:
pgeom(q=3, prob=0.35)
## [1] 0.8215
Note that q is one less than the number of trials being tested. This says, return the probability that there will be q failures before the first success.
When to use: Given a fixed number of trials, you need to know the probability of a certain number of successes, or the number of anticipated successes for a certain probability.
The probability of k successes in n trials is:
\[p^k(1-p)^{n-k}\]
There can be multiple ways of those k successes occurring. For example, if k is 1, then it can occur in any one of the n trials. So the total probability of k successes in n trials is:
\[{n \choose k} p^k(1-p)^{n-k}\]
Where:
\[{n \choose k} = \frac{n!}{k!(n-k)!}\]
The mean of the number of observed successes is: \[\mu = np\]
The variance of the number of observed successes is: \[np(1-p)\]
And so the standard deviation of the number of observed successes is: \[\sqrt{np(1-p)}\]
pnorm shows the probability (as a percentile) of having q or fewer successes in size trials. For example:
pbinom(q=1, size=4, prob=0.35)
## [1] 0.563
This says that, in four trials when the probability of success is 0.35, there is about a 0.563 probability of having zero or one successes. If you want to know the probability of having exactly one success:
pbinom(q=1, size=4, prob=0.35) - pbinom(q=0, size=4, prob=0.35)
## [1] 0.3845
This is the number of trials expected before the first success:
pbinom(q=4, size=4, prob=0.35) - pbinom(q=0, size=4, prob=0.35)
## [1] 0.8215
Note that this can be done with pgeom, as shown above. The following, again, shows the probability that there will be q failures before the first success:
pgeom(q=3, prob=0.35)
## [1] 0.8215
This is also equivalent to:
0.65^0*0.35 + 0.65^1*0.35 + 0.65^2 * 0.35 + 0.65^3 * 0.35
## [1] 0.8215
The latter is the probability of a success on the firsttrial, plus the same on the second, third and fourth trials.
Need to fill in more but here are the basics.
Suppose that the probability of someone smoking is 0.2. A sample of 400 people is drawn. What is the probability that no more than 59 are smokers?
To do this with a binomial distribution, we must calculate the probability of 0, 1, 2…59 smokers individually, and sum them. Or we can use R.
pbinom(59, 400, 0.2)
## [1] 0.004112
Summing up all of the individual probabilities is tedious. Or might I say, was tedious, back in the dark ages before computers could do things like this instantly. But the canon of statistical teaching still includes using the normal distribution to approximate the binomial distribution. We can use a mean of:
\[\mu = np\]
And a standard deviation of:
\[\sqrt{np(1-p)}\]
And so:
n = 400
p = 0.2
mu = n * p
sd = sqrt(n * p * (1-p))
pnorm(59, mu, sd)
## [1] 0.004332
The values are similar but not identical. Need to explain/illustrate why.
When to use: Given a fixed number of successes, you need to know the anticipated number of trials, given a certain probability, or the probability of a certain number of trials.
Very quickly. The probability of four successes in six trials, when probability of success is 0.8:
n=6
k=4
dnbinom(x=n-k, size=k, prob=0.8)
## [1] 0.1638
Note that binomial distribution is based on a fixed number of trials; negative binomial is based on a fixed number of successes.
The following shows the probability of getting four successes in 4, 5, or 6 trials (I think):
pnbinom(q=n-k, size=k, prob=0.8)
## [1] 0.9011
sum(dnbinom(x=0:2, size=k, prob=0.8))
## [1] 0.9011
\[P(\text{observe }k\text{ rare events})= \frac{\lambda^ke^{-k}}{k!}\]
Consider that a rare event occurs an average of 4.4 times per given period. What is the probability of it occuring 5 times?
lambda=4.4
k=5
(lambda^k * exp(-lambda))/factorial(k)
## [1] 0.1687
dpois(5, 4.4)
## [1] 0.1687
Probability of occurring 5 or fewer times:
sum(dpois(0:5, 4.4))
## [1] 0.7199
ppois(5, 4.4)
## [1] 0.7199
And more than 5:
1-sum(dpois(0:5, 4.4))
## [1] 0.2801
ppois(5, 4.4, lower.tail=FALSE)
## [1] 0.2801
See plnorm
This distribution is not covered in chapter 3 of the DASI text; it is introduced in chapter 6, in the context of comparing 3 or more values for a categorical variable.
I ran a simple experiment. Imagine a sample that has 40 positives and 60 negatives. The population has a proportion of 0.5 for each. Chi-square test shouldn’t be done (I think) on only two values/bins, but let’s try it.
tmp <- chisq.test(x=c(40, 60), p=c(0.5, 0.5))
tmp$p.value
## [1] 0.0455
The p-value is a bit below 0.05 so we conclude (in this perhaps invalid test) that the sample is biased and does not adequately represent the population.
Now do a hypothesis test on the proportion, with the null hypothesis that the true proportion is 0.5.
n <- 100
point.est <- 0.4
null.prop <- 0.5
se <- sqrt((null.prop*(1-null.prop))/n)
z.score <- (point.est - null.prop)/se
pnorm(z.score) * 2 # double-sided test
## [1] 0.0455
The p-values are identical. So maybe it’s OK to use chi-square for less than three bins?
Also note that chi-square test seems to presume 95% confidence level, and I can’t find a way to alter that. You could however take the \(X^2\) value and degrees of freedom and determine a p-value for any confidence level.
What percent of a standard normal distribution \(N(\mu=0, \sigma=1)\) is found in each region?
pnorm(q=-1.35, mean=0, sd=1)
## [1] 0.08851
pnorm(q=1.48, mean=0, sd=1, lower.tail=FALSE)
## [1] 0.06944
pnorm(q=1.5, mean=0, sd=1) - pnorm(q=-0.4, mean=0, sd=1)
## [1] 0.5886
pnorm(q=-2, mean=0, sd=1) +
pnorm(q=2, mean=0, sd=1, lower.tail=FALSE)
## [1] 0.0455
What percent of a standard normal distribution \(N(\mu=0, sd=1)\) is found in each region?
pnorm(q=-1.13, mean=0, sd=1, lower.tail=FALSE)
## [1] 0.8708
pnorm(q=0.18, mean=0, sd=1)
## [1] 0.5714
pnorm(q=8, mean=0, sd=1, lower.tail=FALSE)
## [1] 6.221e-16
1 - (pnorm(-0.5, mean=0, sd=1) +
pnorm(0.5, mean=0, sd=1, lower.tail=FALSE))
## [1] 0.3829
#or
pnorm(0.5, mean=0, sd=1) - pnorm(-0.5, mean=0, sd=1)
## [1] 0.3829
A college senior who took the GRE scored 620 on the verbal reasoning section and 670 on the quantitative reasoning section. The mean score for the verbal reasoning section was 462 with a standard deviation of 119, and the mean score for the quantitative reasoning section was 584 with a standard deviation of 151. Suppose that both distributions are nearly normal.
Verbal: \(N(\mu=462, \sigma=119)\)
Quantitiative: \(N(\mu=584, \sigma=151)\)
(620-462)/119
## [1] 1.328
On the quantitative reasoning section?
(670-584)/151
## [1] 0.5695
Draw a standard normal distribution curve and mark these two Z scores.
x <- seq(-3, 3, length=100)
y <- dnorm(x)
plot(x, y, type="l", lty=1)
abline(v=(620-462)/119, col="blue")
abline(v=(670-584)/151, col="red")
She scored 1.328 standard deviations above the mean in verbal and 0.5695 standard deviations above the mean in quantitative.
Verbal
#verbal:
pnorm(q=620, mean=462, sd=119)
## [1] 0.9079
#or
pnorm(q=(620-462)/119, mean=0, sd=1)
## [1] 0.9079
#quantitative:
pnorm(q=670, mean=584, sd=151)
## [1] 0.7155
#or
pnorm(q=(670-584)/151, mean=0, sd=1)
## [1] 0.7155
100 - (pnorm(q=620, mean=461, sd=119)*100)
## [1] 9.075
Quantitative?
100 - (pnorm(q=670, mean=584, sd=151)*100)
## [1] 28.45
It depends on what you mean by an “incorrect conclusion”. You might very well, correctly, conclude that she did in fact perform better on the quantitative reasoning section, if your criterion is percentage of all points obtained. Let me put it this way. Do you want a brain surgeon who performed better than 99% of all students on literary criticism exam but only 1% of all students on a brain surgery exam? I could make this even more extreme. Suppose that the mean score on the literary criticism exam is 3 (out of 100), and the mean score on the brain surgery exam is 90 (out of 100). We see that as a rule, brain surgeons know nothing about literary criticism, but they know a lot about brain surgery. Our “winner” can talk all day about Derrida but doesn’t know what a medulla oblongata is.
However, if our criterion is to compare a student’s score to the performance of the population of students (and we assume that this population is normal), then we should look for the higher Z score, which tells us how many standard deviations above the mean a student scored.
Considering that Z scores no longer have meaning in comparison to each other, then yes, my answers would generally be “dunno”.
In triathlons, it is common for reacers to be placed into age and gender groups. Friends Leo and Mary both completed the Hermosa Beach triathlon, where Leo competed in the Men Ages 30-34 group while Mary competed in the Women Ages 25-29 group. Leo completed the race in 1:22:28 (4948 seconds) while Mary completed the race in 1:31:53 (5513 seconds). Obviously, Leo finished faster, but they are curious how they did within their respective groups. Here is the information on the performance of their groups:
Men Ages 30-34 mean=4313, sd=583
Women Ages 25-29 mean=5261, sd=807
Distributions for both are approximately normal
Men Ages 30-34: \(N(\mu=4313, sd=583)\)
Women Ages 25-29: \(N(\mu=5261, sd=807)\)
leo.z = (4948-4313)/583
mary.z = (5513-5261)/807
leo.z
## [1] 1.089
mary.z
## [1] 0.3123
if (leo.z > mary.z) {
print("Leo")
}
## [1] "Leo"
if (leo.z < mary.z) {
print("Mary")
}
100 - (pnorm(q=leo.z)*100)
## [1] 13.8
100 - (pnorm(q=mary.z)*100)
## [1] 37.74
DUH
In exercise 3.3 we saw two distributions for GRE scores, \(N(\mu=462, \sigma=119)\) for verbal and \(N(\mu=584, \sigma=151)\) for quantitative.
qnorm(p=0.8, mean=462, sd=119)
## [1] 562.2
qnorm(0.7, mean=584, sd=151)
## [1] 663.2
In exercise 3.4, we saw two distributions for triathlon times: \(N(\mu=4313, \sigma=583)\) for Men ages 30-34 and \(N(\mu=5261, \sigma=807)\) for Women ages 25-29.
qnorm(0.05, mean=4313, sd=583)
## [1] 3354
qnorm(0.9, mean=5261, sd=807)
## [1] 6295
The average daily high temperature in June in LA is 77F with a standard deviation of 5F. Suppose the temperatures closely follow a normal distribution.
pnorm(83, mean=77, sd=5, lower.tail=FALSE)
## [1] 0.1151
qnorm(0.1, mean=77, sd=5)
## [1] 70.59
The Capital Asset Pricing Model is a financial model that assumes returns on a portfolio are normally distributed. Suppose a portfolio has an average annual return of 14.7% (i.e. an average gain of 14.7%) with a standard deviation of 33%. A return of 0% means the value of the portfolio doesn’t change, a negative return means that the portfolio loses money, and a positive return means that the portfolio gains money.
pnorm(q=0, mean=0.147, sd=0.33)
## [1] 0.328
qnorm(0.85, mean=0.147, sd=0.33)
## [1] 0.489
Exercise 3.7 states that the averge daily high temperatures in LA in June is 77F with a standard deviation of 5F, and it can be assumed they follow normal distribution. We use the following to convert F to C:
\[C = (F-32) \cdot \frac{5}{9}\]
\[N(\mu=25, \sigma=2.78)\]
pnorm(q=28, mean=25, sd=2.78, lower.tail=FALSE)
## [1] 0.1403
No. 28C “roughly” corresponds to 83F but is not an exact match. 83F is closer to 28.333C. If I use the latter, I get the same result.
Heights of 10 year olds, regardless of gender, closely follow a normal distribution with mean 55 inches and standard deviation 6 inches.
pnorm(q=48, mean=55, sd=6)
## [1] 0.1217
pnorm(q=65, mean=55, sd=6) - pnorm(60, mean=55, sd=6)
## [1] 0.1545
qnorm(p=0.9, mean=55, sd=6)
## [1] 62.69
pnorm(q=54, mean=55, sd=6)
## [1] 0.4338
The distribution of passenger vehicle speeds traveling on the Interstate 5 Freeway (I-5) in California is nearly normal with a mean of 72.6 miles/hour and a standard deviation of 4.78 miles/hour.
pnorm(q=80, mean=72.6, sd=4.78)
## [1] 0.9392
pnorm(q=80, mean=72.6, sd=4.78) - pnorm(q=60, mean=72.6, sd=4.78)
## [1] 0.935
qnorm(p=0.95, mean=72.6, sd=4.78)
## [1] 80.46
pnorm(q=70, mean=72.6, sd=4.78, lower.tail=FALSE)
## [1] 0.7068
Suppose weights of the checked baggage of airline passengers follow a nearly normal distribution with mean 45 pounds and standard deviation 3.2 pounds. Most airlines charge a fee for baggage that weigh in excess of 50 pounds. Determine what percent of airline passengers incur this fee.
pnorm(q=50, mean=45, sd=3.2, lower.tail=FALSE)
## [1] 0.05909
mensa.z <- qnorm(0.98, mean=0, sd=1)
mensa.sd <- (132-100)/mensa.z
mensa.sd
## [1] 15.58
chol.z <- qnorm(p=(1-0.185), mean=0, sd=1)
chol.sd <- (220 - 185) / chol.z
chol.sd
## [1] 39.04
The textbook you need to buy for your chemistry class is expensive at the college bookstore, so you consider buying it on Ebay instead. A look at past auctions suggest that the prices of that chemistry textbook have an approximately normal distribution with mean $89 and standard deviation $15.
pnorm(q=100, mean=89, sd=15, lower.tail=FALSE)
## [1] 0.2317
Too high: You ensure you get the item, but you might overpay. If you are doing this on many items, you will presumably not overpay significantly all that often.
Too low: You probably won’t get the item, but if you do, it’s for a very low price. If you do this on many items, you have a high probability of getting some good deals, but also you will almost certainly not get many items.
Choosing a percentile of 90% is probably fairly safe.
No, you cannot “ensure”, only increase probability. If you set a maximum of $10,000 you will almost certainly win, but if you are crazy enough to bid that much, somebody else could be crazy enough to bid even more.
qnorm(p=0.9, mean=89, sd=15)
## [1] 108.2
SAT scores (out of 2400) are distributed normally with a mean of 1500 and a standard deviation of 300. Suppose a school council awards a certificate of excellence to all students who score at least 1900 on the SAT, and suppose we pick one of the recognized students at random. What is the probability this student’s score will be at least 2100?
pnorm(q=1900, mean=1500, sd=300)
## [1] 0.9088
Below are final exam score of 20 Introductory Statistics students.
stats.scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
The mean score is 77.7 with a stadard deviation of 8.44 points. Use this info to determine whether scores approximately follow 68-95-99.7% rule.
#68% of 20 tests is about 14 tests.
min.68 <- stats.scores[4]
max.68 <- stats.scores[17]
min.68
## [1] 71
qnorm(p=0.34, mean=77.7, sd=8.44)
## [1] 74.22
max.68
## [1] 83
qnorm(p=0.76, mean=77.7, sd=8.44)
## [1] 83.66
min.95 <- stats.scores[1]
max.95 <- stats.scores[20]
min.95
## [1] 57
qnorm(p=0.05, mean=77.7, sd=8.44)
## [1] 63.82
max.95
## [1] 94
qnorm(p=0.95, mean=77.7, sd=8.44)
## [1] 91.58
There are insufficient data points to test the 99.7% part of the rule. But at 68% and 95%, the data looks to conform fairly well. The value at 5% was not a terribly close match, but we can probably attribute this to small sample size.
Below are heights of 25 college female students.
heights <- c(54, 55, 56, 56, 57, 58, 58, 59, 60, 60, 60, 61, 61, 62, 62, 63, 63, 63, 64, 65, 65, 67, 67, 69, 73)
The mean height is 61.52 inches with a standard deviation of 4.58 inches. Use this info to determine whether the heights approximately follow the 68-95-99.7% rule.
#68% of 25 is 17, so our min and max points are 5 and 21
min.68 <- heights[5]
max.68 <- heights[21]
min.68
## [1] 57
qnorm(p=0.34, mean=61.52, sd=4.58)
## [1] 59.63
max.68
## [1] 65
qnorm(p=0.68, mean=61.52, sd=4.58)
## [1] 63.66
#closest approximation to 95% is first and last data points
min.95 <- heights[1]
max.95 <- heights[25]
min.95
## [1] 54
qnorm(0.05, mean=61.52, sd=4.58)
## [1] 53.99
max.95
## [1] 73
qnorm(0.95, mean=61.52, sd=4.58)
## [1] 69.05
No reasonable approximation for 99.7%, but 68 and 95 are reasonably close.
3.19 Exercise 3.17 lists the final exam scores of 20 Introductory Statistics students. Do these data appear to follow a normal distribution?
hist(stats.scores)
qqnorm(stats.scores)
Yes, especially considering the small sample size, the data looks reasonably like a normal distribution.
3.20 Heights of female college students, part II
Exercise 3.18 lists the heights of 25 female college students. Do these data appear to follow a normal distribution?
hist(heights)
qqnorm(heights)
This data set appears to be right-skewed and so may not follow a normal distribution. More data would clarify.
Determine if each trial can be considered a Bernoulli trial for the following situations.
No. A poker hand is never a “success-fail” trial on its own.
Again no. There are six possible outcomes. If the trial is “a 6 or not a 6” then it would be a Bernoulli trial.
In the following situations assume that half the population is male and the other half is female.
0.5 and
4/9
## [1] 0.4444
0.5 and
4999/9999
## [1] 0.4999
Yes it is reasonable. In a large population, when presumably the sample size is a tiny percentage of the population, it makes very little difference whether the sampling is done with or without replacement.
The 2010 American Community Survey estimates that 47.1% of women ages 15 years and over are married.
pbinom(q=1, size=3, prob=0.471) - pbinom(q=0, size=3, prob=0.471)
## [1] 0.3954
#or
dbinom(x=1, size=3, prob=0.471)
## [1] 0.3954
dbinom(3, 3, 0.471)
## [1] 0.1045
p = 0.471
mu <- 1 / p
mu
## [1] 2.123
sd = sqrt((1-p)/p^2)
sd
## [1] 1.544
p <- 0.3
1 / p
## [1] 3.333
sqrt((1-p)/p^2)
## [1] 2.789
The mean goes up; it takes more trials to get a success. The standard deviation also goes up, because the mean (which is less than 1) goes down, and thus the numerator in the formula for sd goes up, while the denominator goes down.
A machine that produces a special type of transistor (a component of computers) has a 2% defective rate. The production is considered a random process where each transistor is independent of the others.
#x = # trials - # successes, size=# successes
dnbinom(x=9, size=1, prob=0.02)
## [1] 0.01667
#or, x=number of failures before success
dgeom(x=9, prob=0.02)
## [1] 0.01667
0.98^100
## [1] 0.1326
#or
dgeom(x=100, prob=0.02)/0.02 #the /0.02 undoes the last successful trial
## [1] 0.1326
p = 0.02
mu = 1/p
mu
## [1] 50
sd = sqrt((1-p)/p^2)
sd
## [1] 49.5
p = 0.05
mu = 1 / p
mu
## [1] 20
sd = sqrt((1-p)/p^2)
sd
## [1] 19.49
Both decrease
A husband and wife both have brown eyes but carry genes that make it possible for their children to have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).
dgeom(x=2, prob=0.125)
## [1] 0.0957
p <- 0.125
mu <- 1/p
mu
## [1] 8
sd <- sqrt((1-p)/p^2)
sd
## [1] 7.483
Exercise 3.12 states that the distribution of speeds of cars traveling on the Interstate 5 Freeway (I-5) in California is nearly normal with a mean of 72.6 miles/hour and a standard deviation of 4.78 miles/hour. The speed limit on this stretch of the I-5 is 70 miles/hour.
z <- (72.6 - 70) / 4.78
p <- pnorm(z)
p^5
## [1] 0.1763
mu <- 1 / p
mu
## [1] 1.415
sd <- sqrt((1-p)/p^2)
sd
## [1] 0.7662
The Substance Abuse and Mental Health Services Administration estimated that 70% of 18-20 year olds consumed alcoholic beverages in 2008.
The conditions: 1) Fixed number of trials, 2) trials are independent, 3) each trial is success/failure, 4) equal probability of success in each trial.
Yes this appears to meet all of the conditions.
dbinom(x=6, size=10, prob=0.7)
## [1] 0.2001
dbinom(x=4, size=10, prob=0.3)
## [1] 0.2001
pbinom(q=2, size=5, prob=0.7)
## [1] 0.1631
pbinom(q=1, size=5, prob=0.7, lower.tail=FALSE)
## [1] 0.9692
The National Vaccine Information Center estimates that 90% of Americans have had chickenpox by the time they reach adulthood.
Yes, because 1) number of trials fixed, 2) trials are idependent, 3) success/fail, 4) equal probability of success in each trial.
dbinom(x=97, size=100, prob=0.9)
## [1] 0.005892
dbinom(x=3, size=100, prob=0.1)
## [1] 0.005892
pbinom(q=1, size=10, prob=0.9, lower.tail=FALSE)
## [1] 1
pbinom(q=3, size=10, prob=0.1)
## [1] 0.9872
We learned in Exercise 3.27 that about 70% of 18-20 year olds consumed alcoholic beverages in 2008. We now consider a random sample of fifty 18-20 year olds.
p <- 0.7
n <- 50
mu <- n*p
mu
## [1] 35
sd <- sqrt(n*p*(1-p))
sd
## [1] 3.24
z <- abs(45-mu) / sd
if (z > 1) {
print("Yes, the probability is low that 45 or more of 50 would have consumed alcohol:")
pnorm(z, lower.tail=FALSE)
} else {
print("No, the probability is reasonable that 45 or more out of 50 would have consumed alcohol:")
pnorm(z, lower.tail=FALSE)
}
## [1] "Yes, the probability is low that 45 or more of 50 would have consumed alcohol:"
## [1] 0.001014
pnorm(z, lower.tail=FALSE)
## [1] 0.001014
It’s essentially the same thing.
We learned in Exercise 3.28 that about 90% of American adults had chickenpox before adulthood. We now consider a random sample of 120 American adults.
n <- 120
p <- 0.9
mu <- n * p
mu
## [1] 108
sd <- sqrt(n*p*(1-p))
sd
## [1] 3.286
z <- abs(105 - mu)/sd
if (z > 1) {
print("yes")
} else {
print("no")
}
## [1] "no"
pnorm(z, lower.tail=FALSE)
## [1] 0.1807
Suppose a university announced that it admitted 2,500 students for the following year’s freshman class. However, the university has dorm room spots for only 1,786 freshman students. If there is a 70% chance that an admitted student will decide to accept the offer and attend this university, what is the what is the approximate probability that the university will not have enough dormitory room spots for the freshman class?
pbinom(q=1787, size=2500, prob=0.7, lower.tail=FALSE)
## [1] 0.05033
Pew Research reported in 2012 that the typical response rate to their surveys is only 9%. If for a particular survey 15,000 households are contacted, what is the probability that at least 1,500 will agree to respond?
pbinom(q=1500, size=15000, prob=0.09, lower.tail=FALSE)
## [1] 1.173e-05
A dreidel is a four-sided spinning top with the Hebrew letters nun, gimel, hei, and shin, one on each side. Each side is equally likely to come up in a single spin of the dreidel. Suppose you spin a dreidel three times. Calculate the probability of getting
pbinom(q=1, size=3, prob=0.25, lower.tail=FALSE)
## [1] 0.1562
dbinom(x=2, size=3, prob=0.25)
## [1] 0.1406
dbinom(x=1, size=3, prob=0.25)
## [1] 0.4219
pbinom(q=2, size=3, prob=0.25)
## [1] 0.9844
A 2005 Gallup Poll found that that 7% of teenagers (ages 13 to 17) suffer from arachnophobia and are extremely afraid of spiders. At a summer camp there are 10 teenagers sleeping in each tent. Assume that these 10 teenagers are independent of each other.
pbinom(q=1, size=10, prob=0.07, lower.tail=FALSE)
## [1] 0.1517
dbinom(x=2, size=10, prob=0.07)
## [1] 0.1234
pbinom(q=1, size=10, prob=0.07)
## [1] 0.8483
p <- 0.07
n <- 10
mu <- n * p
mu
## [1] 0.7
sd <- sqrt(n*p*(1-p))
sd
## [1] 0.8068
Random assignment is expected to result in no more than 1 fearmongerer per tent, but the standard deviation is about 0.8, and it is not unreasonable to think that there will be times when 2 of them are in one tent. The counselors should ask the teens who is afraid and then assign manually.
Exercise 3.25 introduces a husband and wife with brown eyes who have 0.75 probability of having children with brown eyes, 0.125 probability of having children with blue eyes, and 0.125 probability of having children with green eyes.
0.125*(1-0.125)
## [1] 0.1094
#or
dbinom(x=1, size=1, prob=0.125) * dbinom(x=1, size=1, prob=0.875)
## [1] 0.1094
dbinom(x=1, size=2, prob=0.125)
## [1] 0.2187
dbinom(x=2, size=6, prob=0.125)
## [1] 0.1374
pbinom(q=1, size=6, prob=0.125, lower.tail=FALSE)
## [1] 0.1665
0.875^3 * 0.125^1
## [1] 0.08374
pr <- pbinom(q=3, size=6, prob=0.75, lower.tail=FALSE)
if (pr > 0.5) {
print(paste0("Yes ", pr))
} else {
print(paste0("No", pr))
}
## [1] "Yes 0.83056640625"
Sickle cell anemia is a genetic blood disorder where red blood cells lose their flexibility and assume an abnormal, rigid, “sickle” shape, which results in a risk of various complications. If both parents are carriers of the disease, then a child has a 25% chance of having the disease, 50% chance of being a carrier, and 25% chance of neither having the disease nor being a carrier. If two parents who are carriers of the disease have 3 children, what is the probability that
dbinom(x=2, size=3, prob=0.25)
## [1] 0.1406
dbinom(x=0, size=0, prob=0.25)
## [1] 1
pbinom(q=1, size=3, prob=0.5, lower.tail=FALSE)
## [1] 0.5
0.75^2 * 0.25^1
## [1] 0.1406
In the game of roulette, a wheel is spun and you place bets on where it will stop. One popular bet is that it will stop on a red slot; such a bet has an 18/38 chance of winning. If it stops on red, you double the money you bet. If not, you lose the money you bet. Suppose you play 3 times, each time with a $1 bet. Let Y represent the total amount won or lost. Write a probability model for Y.
p <- 18/38
n <- 3
win.0 <- dbinom(x=0, size=n, prob=p)
win.1 <- dbinom(x=1, size=n, prob=p)
win.2 <- dbinom(x=2, size=n, prob=p)
win.3 <- dbinom(x=3, size=n, prob=p)
print(paste0(win.0, " probability of losing $3"))
## [1] "0.145793847499636 probability of losing $3"
print(paste0(win.1, " probability of losing $1"))
## [1] "0.393643388249016 probability of losing $1"
print(paste0(win.2, " probability of winning $1"))
## [1] "0.354279049424114 probability of winning $1"
print(paste0(win.3, " probability of winning $3"))
## [1] "0.106283714827234 probability of winning $3"
overall <- (win.0 * -3) + (win.1 * -1) + (win.2 * 1) + (win.3 * 3)
print(paste0("Overall expected win/loss of ", overall))
## [1] "Overall expected win/loss of -0.157894736842106"
#or
mu <- n * p
overall <- (3 * -1) + (mu * 2)
overall
## [1] -0.1579
In a multiple choice quiz there are 5 questions and 4 choices for each question (a, b, c, d). Robin has not studied for the quiz at all, and decides to randomly guess the answers. What is the probability that
0.75^2 * 0.25^1
## [1] 0.1406
sum(dbinom(x=3:4, size=5, prob=0.25))
## [1] 0.1025
pbinom(q=3, size=5, prob=0.25, lower.tail=FALSE)
## [1] 0.01562
The formula for the number of ways to arrange n objects is
\[n! = n * (n-1) * ... * 2 * 1\]
This exercise walks you through the derivation of this formula for a couple of special cases.
A small company has five employees: Anna, Ben, Carl, Damian, and Eddy. There are five parking spots in a row at the company, none of which are assigned, and each day the employees pull into a random parking spot. That is, all possible orderings of the cars in the row of spots are equally likely.
1/factorial(5)
## [1] 0.008333
factorial(5)
## [1] 120
factorial(8)
## [1] 40320
While it is often assumed that the probabilities of having a boy or a girl are the same, the actual probability of having a boy is slightly higher at 0.51. Suppose a couple plans to have 3 kids.
dbinom(x=2, size=3, prob=0.51)
## [1] 0.3823
#or
B-B-G, B-G-B, G-B-B
(0.51*0.51*0.49) + (0.51*0.49*0.51) + (0.49*0.51*0.51)
## [1] 0.3823
Confirm that your answers from parts (a) and (b) match.
DUH because you have to calculate \({8 \choose 3}\) scenarios using the second method.
Calculate the following probabilities and indicate which probability distribution model is appropriate in each case. You roll a fair die 5 times. What is the probability of rolling
(5/6)^4 * (1/6)^1
## [1] 0.08038
#or
dbinom(x=1, size=5, prob=1/6) / choose(5, 1)
## [1] 0.08038
#or
dnbinom(x=4, size=1, prob=1/6)
## [1] 0.08038
#the above says, return probability of four failures and
#one success--this could also apply to success in any one of the other trials
#or
dgeom(x=4, prob=1/6)
## [1] 0.08038
#the above says, probability of four failures followed by a success
dbinom(x=3, size=5, prob=1/6)
## [1] 0.03215
#first get the probability of 2 6s on the first four rolls
#then the probability of a sixth on one roll
#and multiply the two together
dbinom(x=2, size=4, prob=1/6) * dbinom(x=1, size=1, prob=1/6)
## [1] 0.01929
#or
dnbinom(x=2, size=3, prob=1/6)
## [1] 0.01929
#the above says, returns probability that there are two failures
#and three successes, in any order
Calculate the following probabilities and indicate which probability distribution model is appropriate in each case. A very good darts player can hit the bullseye (red circle in the center of the dart board) 65% of the time. What is the probability that he
dbinom(x=9, size=14, prob=0.65) * dbinom(x=1, size=1, prob=0.65)
## [1] 0.1416
#or
dnbinom(x=5, size=10, prob=0.65)
## [1] 0.1416
dbinom(x=10, size=15, prob=0.65)
## [1] 0.2123
#or
choose(15, 10) * 0.65^10 * 0.35^5
## [1] 0.2123
dgeom(x=2, prob=0.65)
## [1] 0.07962
#or
0.35^2 * 0.65^1
## [1] 0.07962
For a sociology class project you are asked to conduct a survey on 20 students at your school. You decide to stand outside of your dorm’s cafeteria and conduct the survey on a random sample of 20 students leaving the cafeteria after dinner one evening. Your dorm is comprised of 45% males and 55% females.
The negative binomial, which can return a probability for an expected number of failures and successes, in any order. One of those orders is with the last trial being a success.
dnbinom(x=2, size=2, prob=0.55)
## [1] 0.1838
#or
0.55^2 * 0.45^2 * choose(3, 1)
## [1] 0.1838
#the fourth trial is always female, so we use choose on the first three trials
#to arrive at the number of combinations
\[\{M,M,F,F\},\{M,F,M,F\},\{F,M,M,F\}\]
One common feature among these scenarios is that the last trial is always female. In the first three trials there are 2 males and 1 female. Use the binomial coefficient to confirm that there are 3 ways of ordering 2 males and 1 female.
n <- 4
k <- 2
choose(n-1, k-1)
## [1] 3
#or
factorial(n-1)/(factorial(k-1) * factorial(n-k))
## [1] 3
The last trial in the negative binomial is always a success so we do not need to consider it in determining the number of outcomes. It does not vary. It is always a success.
A not-so-skilled volleyball player has a 15% chance of making the serve, which involves hitting the ball so it passes over the net on a trajectory such that it will land in the opposing team’s court. Suppose that her serves are independent of each other.
What is the probability that on the 10 th try she will make her 3rd successful serve?
Suppose she has made two successful serves in nine attempts. What is the probability that her 10th serve will be successful?
Even though parts (a) and (b) discuss the same scenario, the probabilities you calculated should be different. Can you explain the reason for this discrepancy?
A coffee shop serves and average of 75 customers per hour during the morning rush.
Which distribution we have studied is most appropriate for calculating the probability of a given number of customers arriving within one hour during this time of day?
What are the mean and the standard deviation of the number of customers this coffee shop serves in one hour during this time of day?
Would it be considered unusually low if only 60 customers showed up to this coffee shop in one hour during this time of day?
A very skilled court stenographer makes one typographical error (typo) per hour on average.
What probability distribution is most appropriate for calculating the probability of a given number of typos this stenographer makes in an hour?
What are the mean and the standard deviation of the number of typos this stenographer makes?
Would it be considered unusual if this stenographer made 4 typos in a given hour?
Exercise 3.45 gives the average number of customers visiting a particular coffee shop during the morning rush hour as 75. Calculate the probability that this coffee shop serves 70 customers in one hour during this time of day?
Exercise 3.46 gives the average number of typos of a very skilled court stenographer as 1 per hour. Calculate the probability that this stenographer makes at most 2 typos in a given hour.