Load some potentially useful packages:

library(ggplot2)

library(dplyr)


Question 1

Alcohol and nicotine consumption during pregnancy may harm children. One study classified 452 mothers according to their alcohol intake and their nicotine intake during pregnancy. The data are summarized in the following table:

Nicotine (milligrams/day)
Alcohol(ounces/day) None 1-15 16 or more
None 105 7 11
0.01-0.10 58 5 13
0.11-0.99 84 37 42
1.00 or more 57 16 17


(a) If one of these mothers is selected at random, what is the chance that they had some alcohol intake or some nicotine intake during their pregnancy?

0.767

(b) If one of these mothers is selected at random, what is the chance that they took in at least 16 milligrams per day of nicotine and at least 1.00 ounce per day of alcohol?

0.037


(c) If one of these mothers is selected at random, what is the chance that they had some nicotine intake during their pregnancy?

0.327


(d) If one of these mothers is selected at random, and you are told that they had some alcohol intake during their pregnancy, what is the chance that they had some nicotine intake during their pregnancy?

0.2876


(e) If we let A = (intook some nicotine), and B = (intook some alcohol), then we can write the answer to (c) using the following notation:

P( A ) = fill in your answer from (c)

P(A)=0.327

And, we can write the answer to (d) using the following notation:

P( A | B ) = fill in your answer from (d)

P(A|B)=0.2876

Note: The event on the right-hand side of the “|” (here, B) is the information that we “know”, or are “given”. P(A | B) is called a conditional probability, and we often say, “The probability of A given B.”.



Question 2

Suppose we select a person at random from the entire global population.

(a) For each of the following pairs of probabilities, which do you think is bigger?

i. P( lung cancer ) or P( lung cancer | smoker )

the probability of someone who has lung cancer and is a smoker is bigger


ii. P(likes McDonald's) or P(likes McDonald's | vegetarian)


iii. P( smart | Mac grad ) or P( Mac grad | smart )


Note: Part iii. emphasizes how different P(A | B) and P(B | A) can be! People often mix these 2 quantities up, and it is important to understand the distinction between them!



Question 3

Suppose that, in some population, the proportion of people who are HIV-positive is 0.0005. A blood test produces true positives (a positive test for an HIV-positive person) with probability 0.997, and false positives (a positive test for an HIV-negative person) with probability 0.001.

(a) Write 3 probability statements, one for each value quoted, using the notation (e.g. P(A), P(A|B)) introduced in Questions 1. and 2.


(b) Explain why it is perfectly possible that the quantities corresponding to 0.997 and to 0.001 do NOT add up to 1.


(c) Using the values provided, and assuming a hypothetical total population of 10,000 people, fill in the missing values in the “two-way” table below:

HIV+ HIV- Total
Tests positive (T+)
Tests negative (T-)
Total


(d) If someone tests positive for HIV, what is the chance that they are actually HIV-positive?


(e) The answer you obtained in (d) might surprise you as being unintuitively low. Try to think of reasons for why the value is lower than one might expect.



Question 4

The sensitivity of a screening test (e.g. a mammography or prostate cancer “PSA” test) is the probability of a positive test given that the individual has the condition. The specificity is the probability of a negative test given that the individual does not have the condition. The prevalence is the proportion of a population with the condition of interest. Suppose the sensitivity is 90%, the specificity is 80%, and the prevalence is 1%.

(a) Consider the following “two-way” table meant to summarize the information given above:

Has Condition Does not have condition Total
Tests positive (T+) a b a+b
Tests negative (T-) c d c+d
Total a+c b+d a+b+c+d

Assuming that there are 1000 people in total, find the values of a, b, c, and d.

a=9, b=198, c=1, d=792


(b) What is the probability of an individual having the condition if they have tested positive for it? Important Note: This quantity is called the positive predictive value (PPV).

4.3%


(c) What is the probability of an individual NOT having the condition if they have tested negative for it? Important Note: This quantity is called the negative predictive value (NPV).

99%


(d) Explain in your own words why sensitivity and PPV are different quantities, and why specificity and NPV are different quantitites.

the sensitivity refers to the probability that a positive test will occur if an individual has the condition, the PPV refers to the probability that a positive test will reflect the correct diagnosis

NOTE: In some fields, people use the term “recall” instead of “sensitivity”, and they use the term “precision” instead of “PPV”.


(e) Sensitivity and specificity both describe correct decisions of a screening test. Why do you think we consider sensitivity and specificity separately? We’ll try to answer this big question by going through several smaller steps below:

Part 1: A natural way to define the “overall accuracy” of a test is as the overall percentage of the time that the test makes the correct decision. Find the overall accuracy for this test using the table you created in (a).

80.1%


Part 2: Now, consider a (silly/dumb) medical test which classifies everyone as disease-free. Re-create the table from (a) for this (silly/dumb) test.

result of test Has Condition Does Not Have Condition Total
Positive 0 0 0
Negative 10 990 1000
Total 10 990 1000


Part 3: What is the overal accuracy of the test proposed in Part 2? Compare it to the overall accuracy of the original test, that you found in Part 1.

99%


Part 4: If we were using overall accuracy as our metric, would we select the original test from (a) or the (silly/dumb) test?

the original test from (a)


Part 5: What is the specificity of the (silly/dumb) test? What is the sensitivity of the (silly/dumb) test? Compare these values to the specificity and sensitivity of the original test from (a).

specificity=100%, sensitivity=0%, these are very different values than the sensitivity and specificity for part (a)


Part 6: Based on Part 5, would you prefer the original test from (a) or the (silly/dumb) test?

the original test from (a)


Return to the question posed at the start of (e): Why do you think we consider sensitivity and specificity separately? Answer this question in your own words.

because they are two different metrics, sensitivity is measuring how often a positive test reflects someone who has the condition and specificity measures how often a negative test reflects someone who doesn’t have the condition. Finding one does not mean finding the other.



Question 5

The Journal of the American Law and Economics Association published the results of a study of appeals of federal civil trials. The table below (extracted from the article) gives a breakdown of 2143 civil cases that were appealed by either the plaintiff or defendant. The outcome of the appeal, as well as the type of trial (judge or jury) was determined for each case.

Jury Judge Total
Plaintiff trial win - reversed 194 71 265
Plaintiff trial win - affirmed 429 240 669
Defendant trial win - reversed 111 68 179
Defendant trial win - affirmed 731 299 1030
Total 1465 678 2143


(a) Suppose that one of the cases is selected at random, and you are told that it was a jury trial, what is the chance that the appeal reversed the original verdict?

20.8%


(b) Explain how your answer in (a) represents an updated, or conditional, probability (i.e., what other probability should it be compared to).

this represents a conditional probability because


(c) Would you say that “reversal of verdict” and “type of trial” are independent? Note: We haven’t formally introduced the term “independent”, so try to think about what this might mean before answering the question.



Question 6

Instead of using a probability, an alternative way to describe the likelihood of an event occurring is with a quantity called an odds. In everyday language, people often use the terms “probability” and “odds” interchangeably, but they are different!


Definition: The odds that an event occurs is the chance that the event occurs divided by the chance that the event does NOT occur.

In other words, if p = chance that an event occurs, then:

\[ odds = \frac{p}{1-p} \]


(a) For each of the following, convert the given chance to its corresponding odds value:

Part 1: p = 0.1

(0.1)/(1-0.1)
## [1] 0.1111111

0.1111111 ### Part 2: p = 0.25

(0.25)/(1-0.25)
## [1] 0.3333333

0.33333333

Part 3: p = 0.5

(0.5)/(1-0.5)
## [1] 1

1

Part 4: p = 0.75

(0.75)/(1-0.75)
## [1] 3

3

Part 5: p = 0.90

(0.90)/(1-0.90)
## [1] 9

9

Part 6: p = 0.99

(0.99)/(1-0.99)
## [1] 99

99


(b) If we know the odds that an event will occur, we can figure out the probability (p) that it will occur via the following relationship:

\[ p = \frac{odds}{odds + 1} \]


Find the chance of an event occuring if the odds of it occuring are:

Part 1: odds = 0.1

(0.1)/(0.1+1)
## [1] 0.09090909

0.09090909

Part 2: odds = 1

(1)/(1+1)
## [1] 0.5

0.5

Part 3: odds = 4

(4)/(1+4)
## [1] 0.8

0.8


(c) What are the lower and upper bounds on the possible values of:

Part 1: a probability, p:

0.091-0.8

Part 2: an odds:

0.1-99

Part 3: the log of an odds:

-2.2-4.6

Note: You may not remember much about the log function, but we will talk about it more in Topic 7. For now, you can try uncommenting and running the following calculations to help you:

log( 1 )
## [1] 0
log( 0 )
## [1] -Inf
log( 10^300 )
## [1] 690.7755
log( 10^400 )
## [1] Inf


(d) Consider the following made-up two-way table, with a response variable (YES or NO) displayed in the columns, and a predictor variable (HIGH or LOW) displayed in the rows:


Response Variable = YES Response Variable = NO Total
Predictor is HIGH 22 12 34
Predictor is LOW 16 14 30


Find the chance that the response variable is a YES when the predictor is HIGH, AND, when the predictor is LOW.

Predictor is High: 0.6471 Predictor is low: 0.53333

(e) How much more likely is a YES when the predictor is HIGH compared to when the predictor is LOW?

about 0.1137


(f) Now, find the odds that the response variable is a YES when the predictor is HIGH, AND, when the predictor is LOW.

Predictor is High: 1.34 Predictor is Low: 1.143

(g) How many times larger is the odds of a YES when the predictor is HIGH compared to when the predictor is LOW?

1.17

Note 1: The quantity you found in (g) is called an odds ratio.

Note 2: As we’ll see in Topic 8, when the response is a YES-NO variable, it’s more common (and better!) to estimate the effect of a predictor using an odds ratio instead of how it was done in (e).



Question 7

At an airport, each passenger must pass through a fire arms detector. Suppose that, roughly, 1 in 10000 passengers is carrying a firearm. The detector doesn’t miss any firearms, but the chance of a false alarm is about 1 in 1000. If the alarm goes off, what is the chance that the passenger is carrying a fire arm?



Question 8

Suppose the probability that a person has breast cancer is 0.8%. If the person has breast cancer, the probability is 90% that they will have a positive mammogram. If the person does not have breast cancer, the probability is 7% that they will still have a positive mammogram. Imagine a person who has a positive mammogram. What is the chance that they have breast cancer?

                       |Positive Mammogram|Negative Mammogram|Either|
                       |:----------------:|:----------------:|:----:|

|Positive for Breast Cancer| 0.0072 | 0.0008 |0.008 | |Negative for Breast Cancer| 0.0069 | 0.9644 |0.0694| | Either | 0.0141 | 0.9652 |0.0774| 0.51063 chance of having breast cancer with a positive mammogram

Question 9

Incoming students at a school take a math placement exam. The possible scores are: 1, 2, 3, 4. From past experience, the school knows that a student’s score on the exam is linked to the chance of eventually declaring a math major as shown below:


Score on exam 1 2 3 4
Proportion who declare math major 0.15 0.20 0.40 0.50


Moreover, the following table describes the scores obtained by the incoming class:

Score on exam 1 2 3 4
Proportion who obtained that score 0.10 0.25 0.45 0.20


Explain why the values in the second table add up to 1, but the values in the first table do not add up to 1.