Load some potentially useful packages:
library(ggplot2)
library(dplyr)
Question 1
Alcohol and nicotine consumption during pregnancy may harm children.
One study classified 452 mothers according to their alcohol intake and
their nicotine intake during pregnancy. The data are summarized in the
following table:
| Alcohol(ounces/day) |
None |
1-15 |
16 or more |
| None |
105 |
7 |
11 |
| 0.01-0.10 |
58 |
5 |
13 |
| 0.11-0.99 |
84 |
37 |
42 |
| 1.00 or more |
57 |
16 |
17 |
(a) If one of these mothers is selected at random, what is the
chance that they had some alcohol intake or some nicotine intake during
their pregnancy?
0.767
(b) If one of these mothers is selected at random, what is the
chance that they took in at least 16 milligrams per day of nicotine and
at least 1.00 ounce per day of alcohol?
0.037
(c) If one of these mothers is selected at random, what is the
chance that they had some nicotine intake during their pregnancy?
0.327
(d) If one of these mothers is selected at random, and you are told
that they had some alcohol intake during their pregnancy, what is the
chance that they had some nicotine intake during their pregnancy?
0.2876
(e) If we let A = (intook some nicotine), and B = (intook some
alcohol), then we can write the answer to (c) using the following
notation:
P( A ) = fill in your answer from (c)
P(A)=0.327
And, we can write the answer to (d) using the following
notation:
P( A | B ) = fill in your answer from (d)
P(A|B)=0.2876
Question 2
Suppose we select a person at random from the entire global
population.
(a) For each of the following pairs of probabilities, which do you
think is bigger?
i. P( lung cancer ) or
P( lung cancer | smoker )
the probability of someone who has lung cancer and is a smoker is
bigger
ii. P(likes McDonald's) or
P(likes McDonald's | vegetarian)
iii. P( smart | Mac grad ) or
P( Mac grad | smart )
Note: Part iii. emphasizes how different P(A | B)
and P(B | A) can be! People often mix these 2 quantities up, and it is
important to understand the distinction between them!
Question 3
Suppose that, in some population, the proportion of people who are
HIV-positive is 0.0005. A blood test produces true positives (a positive
test for an HIV-positive person) with probability 0.997, and false
positives (a positive test for an HIV-negative person) with probability
0.001.
(a) Write 3 probability statements, one for each value quoted, using
the notation (e.g. P(A), P(A|B)) introduced in Questions 1. and 2.
(b) Explain why it is perfectly possible that the quantities
corresponding to 0.997 and to 0.001 do NOT add up to 1.
(c) Using the values provided, and assuming a hypothetical total
population of 10,000 people, fill in the missing values in the “two-way”
table below:
| Tests positive (T+) |
|
|
|
| Tests negative (T-) |
|
|
|
| Total |
|
|
|
(d) If someone tests positive for HIV, what is the chance that they
are actually HIV-positive?
(e) The answer you obtained in (d) might surprise you as being
unintuitively low. Try to think of reasons for why the value is lower
than one might expect.
Question 4
The sensitivity of a screening test (e.g. a
mammography or prostate cancer “PSA” test) is the probability of a
positive test given that the individual has the condition. The
specificity is the probability of a negative test given
that the individual does not have the condition. The
prevalence is the proportion of a population with the
condition of interest. Suppose the sensitivity is 90%, the specificity
is 80%, and the prevalence is 1%.
Assuming that there are 1000 people in total, find the values of a,
b, c, and d.
a=9, b=198, c=1, d=792
(b) What is the probability of an individual having the condition if
they have tested positive for it? Important Note: This quantity
is called the positive predictive value (PPV).
4.3%
(c) What is the probability of an individual NOT having the
condition if they have tested negative for it? Important Note:
This quantity is called the negative predictive value
(NPV).
99%
(d) Explain in your own words why sensitivity and
PPV are different quantities, and why
specificity and NPV are different
quantitites.
the sensitivity refers to the probability that a positive test will
occur if an individual has the condition, the PPV refers to the
probability that a positive test will reflect the correct diagnosis
NOTE: In some fields, people use the term
“recall” instead of “sensitivity”, and
they use the term “precision” instead of
“PPV”.
(e) Sensitivity and specificity both describe correct decisions of a
screening test. Why do you think we consider sensitivity and specificity
separately? We’ll try to answer this big question by going
through several smaller steps below:
Part 1: A natural way to define the “overall
accuracy” of a test is as the overall percentage of the time
that the test makes the correct decision. Find the overall accuracy for
this test using the table you created in (a).
80.1%
Part 2: Now, consider a (silly/dumb) medical test which classifies
everyone as disease-free. Re-create the table from (a) for this
(silly/dumb) test.
| Positive |
0 |
0 |
0 |
| Negative |
10 |
990 |
1000 |
| Total |
10 |
990 |
1000 |
Part 3: What is the overal accuracy of the test proposed in Part 2?
Compare it to the overall accuracy of the original test, that you found
in Part 1.
99%
Part 4: If we were using overall accuracy as our metric, would we
select the original test from (a) or the (silly/dumb) test?
the original test from (a)
Part 5: What is the specificity of the (silly/dumb) test? What is
the sensitivity of the (silly/dumb) test? Compare these values to the
specificity and sensitivity of the original test from (a).
specificity=100%, sensitivity=0%, these are very different values
than the sensitivity and specificity for part (a)
Part 6: Based on Part 5, would you prefer the original test from (a)
or the (silly/dumb) test?
the original test from (a)
Return to the question posed at the start of (e): Why do you
think we consider sensitivity and specificity separately? Answer
this question in your own words.
because they are two different metrics, sensitivity is measuring how
often a positive test reflects someone who has the condition and
specificity measures how often a negative test reflects someone who
doesn’t have the condition. Finding one does not mean finding the
other.
Question 5
The Journal of the American Law and Economics Association published
the results of a study of appeals of federal civil trials. The table
below (extracted from the article) gives a breakdown of 2143 civil cases
that were appealed by either the plaintiff or defendant. The outcome of
the appeal, as well as the type of trial (judge or jury) was determined
for each case.
| Plaintiff trial win - reversed |
194 |
71 |
265 |
| Plaintiff trial win - affirmed |
429 |
240 |
669 |
| Defendant trial win - reversed |
111 |
68 |
179 |
| Defendant trial win - affirmed |
731 |
299 |
1030 |
| Total |
1465 |
678 |
2143 |
(a) Suppose that one of the cases is selected at random, and you are
told that it was a jury trial, what is the chance that the appeal
reversed the original verdict?
20.8%
(b) Explain how your answer in (a) represents an updated, or
conditional, probability (i.e., what other probability should it be
compared to).
this represents a conditional probability because
Question 6
Instead of using a probability, an alternative way
to describe the likelihood of an event occurring is with a quantity
called an odds. In everyday language, people often use
the terms “probability” and “odds” interchangeably, but they are
different!
Definition: The odds that an event occurs is the chance that
the event occurs divided by the chance that the event does NOT
occur.
In other words, if p = chance that an event occurs,
then:
\[ odds = \frac{p}{1-p} \]
(a) For each of the following, convert the given chance to its
corresponding odds value:
Part 1: p = 0.1
(0.1)/(1-0.1)
## [1] 0.1111111
0.1111111 ### Part 2: p = 0.25
(0.25)/(1-0.25)
## [1] 0.3333333
0.33333333
Part 3: p = 0.5
(0.5)/(1-0.5)
## [1] 1
1
Part 4: p = 0.75
(0.75)/(1-0.75)
## [1] 3
3
Part 5: p = 0.90
(0.90)/(1-0.90)
## [1] 9
9
Part 6: p = 0.99
(0.99)/(1-0.99)
## [1] 99
99
(b) If we know the odds that an event will occur, we can figure out
the probability (p) that it will occur via the following
relationship:
\[ p = \frac{odds}{odds + 1}
\]
Find the chance of an event occuring if the odds of it occuring
are:
Part 1: odds = 0.1
(0.1)/(0.1+1)
## [1] 0.09090909
0.09090909
Part 2: odds = 1
(1)/(1+1)
## [1] 0.5
0.5
Part 3: odds = 4
(4)/(1+4)
## [1] 0.8
0.8
(c) What are the lower and upper bounds on the possible values
of:
Part 1: a probability, p:
0.091-0.8
Part 3: the log of an odds:
-2.2-4.6
(d) Consider the following made-up two-way table, with a response
variable (YES or NO) displayed in the columns, and a predictor variable
(HIGH or LOW) displayed in the rows:
| Predictor is HIGH |
22 |
12 |
34 |
| Predictor is LOW |
16 |
14 |
30 |
Find the chance that the response variable is a YES
when the predictor is HIGH, AND, when the predictor is LOW.
Predictor is High: 0.6471 Predictor is low: 0.53333
(e) How much more likely is a YES when the predictor is HIGH
compared to when the predictor is LOW?
about 0.1137
(f) Now, find the odds that the response variable
is a YES when the predictor is HIGH, AND, when the predictor is
LOW.
Predictor is High: 1.34 Predictor is Low: 1.143
(g) How many times larger is the odds of a YES when the
predictor is HIGH compared to when the predictor is LOW?
1.17
Note 1: The quantity you found in (g) is called an
odds ratio.
Note 2: As we’ll see in Topic 8, when the response
is a YES-NO variable, it’s more common (and better!) to estimate the
effect of a predictor using an odds ratio instead of
how it was done in (e).
Question 7
At an airport, each passenger must pass through a fire arms
detector. Suppose that, roughly, 1 in 10000 passengers is carrying a
firearm. The detector doesn’t miss any firearms, but the chance of a
false alarm is about 1 in 1000. If the alarm goes off, what is the
chance that the passenger is carrying a fire arm?
Question 8
Suppose the probability that a person has breast cancer is 0.8%. If
the person has breast cancer, the probability is 90% that they will have
a positive mammogram. If the person does not have breast cancer, the
probability is 7% that they will still have a positive mammogram.
Imagine a person who has a positive mammogram. What is the chance that
they have breast cancer?
|Positive Mammogram|Negative Mammogram|Either|
|:----------------:|:----------------:|:----:|
|Positive for Breast Cancer| 0.0072 | 0.0008 |0.008 | |Negative for
Breast Cancer| 0.0069 | 0.9644 |0.0694| | Either | 0.0141 | 0.9652
|0.0774| 0.51063 chance of having breast cancer with a positive
mammogram
Question 9
Incoming students at a school take a math placement exam. The
possible scores are: 1, 2, 3, 4. From past experience, the school knows
that a student’s score on the exam is linked to the chance of eventually
declaring a math major as shown below:
| Proportion who declare math
major |
0.15 |
0.20 |
0.40 |
0.50 |
Moreover, the following table describes the scores obtained by the
incoming class:
| Proportion who obtained that
score |
0.10 |
0.25 |
0.45 |
0.20 |
Explain why the values in the second table add up to 1, but the
values in the first table do not add up to 1.