(Agresti, 2.2) For diagnostic testing, let X = true status (1 = disease, 2 = no disease) and Y = diagnosis (1 = positive, 2 = negative). Let π1 = P(Y = 1|X = 1) and π2 = P(Y = 1|X = 2). Let γ denote the probability that a subject has the disease.
(a) Given that the diagnosis is positive, use Bayes’ Theorem to show that the probability a subject truly has the disease is P(X = 1 | Y = 1) = π1γ/(π1γ + π2(1 − γ)).
The probability of getting a positive diagnosis, P(Y=1), is:
P(Y=1) =Probability of testing positive while having the disease + Probability of testing positive while not having the disease\[ P(Y=1) = P(Y=1 ∩ X=1) + P(Y=1 ∩ X=2) \] \[ = P(Y=1|X=1).P(X=1) + P(Y=1|X=2).P(X=2) \] Now, According to Bayes’ Theorem,
\[ P(A|B)=\frac{P(B | A).P(A)}{P(B)} \]
Therefore,
\[ P(X=1|Y=1)=\frac{P(Y=1|X=1).P(X=1)}{P(Y=1)} \]
\[ P(X=1|Y=1)=\frac{P(Y=1|X=1).P(X=1)}{P(Y=1|X=1).P(X=1) + P(Y=1|X=2).P(X=2)} \]
Now, substituting the values from the question, (π1 = P(Y = 1|X = 1) and π2 = P(Y = 1|X = 2), γ= P(X=1), γ-1 = P(X-2)
\[ P(X=1|Y=1)=\frac{π1.γ}{π1.γ + π2.(1-γ)} \]
(b) For mammograms for detecting breast cancer, suppose γ = 0.01, sensitivity = π1 =0.86, and specificity = 1 − π2 = 0.88. Find the positive predictive value.
Here,
\[ P(X=1|Y=1)=\frac{π1.γ}{π1.γ + π2.(1-γ)} \]
\[ P(X=1|Y=1)=\frac{0.86*0.01}{0.86*0.01 + (1-0.88)*(1-0.01)}=0.0675 \]
Therefore, the probability of testing positive while being truly sick is 0.0675.
(c) To better understand the answer in (b), find the joint probabilities for the 2 × 2 cross classification of X and Y . Discuss their relative sizes in the two cells that refer to a positive test result.
Here,
Y= 0.01
pi_1 =0.86
pi_2 = (1-0.88)
P_X1∩Y1 = pi_1*Y
P_X1∩Y2 = Y-P_X1∩Y1
P_X2∩Y1 = pi_2*(1-Y)
P_X2∩Y2 = (1-Y)-P_X2∩Y1
pos_total=P_X1nY1+P_X2∩Y1
neg_total=P_X2∩Y2+P_X1∩Y2
P_X1∩Y1
## [1] 0.0086
P_X1∩Y2
## [1] 0.0014
P_X2∩Y1
## [1] 0.1188
P_X2∩Y2
## [1] 0.8712
pos_total
## [1] 0.1274
neg_total
## [1] 0.8726
| Y=1 | Y=2 | Total | |
|---|---|---|---|
| X=1 | 0.0086 | 0.0014 | 0.01 |
| X=2 | 0.1188 | 0.8712 | 0.99 |
| Total | 0.1274 | 0.8726 | 1 |
Cells containing positive test results are:
P(X=1 ∩ Y=1) = 0.0086
P(X=2 ∩ Y=1) = 0.1188
Therefore, the probability of women who do not have cancer but test positive is greater than the probability of women who have cancer and test positive.
-
For adults who sailed on the Titanic on its fateful voyage, the odds ratio between gender (female, male) and survival (yes, no) was 11.4.
(a) What is wrong with the interpretation, “The probability of survival for females was 11.4 times that for males?” Give the correct interpretation.
The interpretation is wrong, because the odds ratio gives ratio of the odds of a particular event occurring in one group to the odds of an event occurring in anther group. It is not a probability, and cannot be generalized to all events, and only applies to the particular event under the particular circumstances, which in this case is the event of survival on the Titanic ship.
The correct interpretation is: for the adults who sailed on the Titanic, the odds of survival for women on the ship was 11.4 times the odds of survival for their men counterparts.
(b) The odds of survival for females equaled 2.9. For each gender, find the proportion who survived. Find the value of RR in the interpretation, “The probability of survival for females was RR times that for males.”
odds_r = 11.4
odds_F = 2.9
prop_F =odds_F/(odds_F+1)
odds_M = (odds_F)/odds_r
prop_M=odds_M/(odds_M+1)
RR=prop_F/prop_M
prop_F
## [1] 0.7435897
prop_M
## [1] 0.2027972
RR
## [1] 3.666667
The proportion who survived the Titanic voyage was 0.744 for females and 0.203 for males. The relative risk, RR, is 3.667. Therefore, the probability of survival for females was 3.667 times that for males.
-
Table 2.11, below, cross-classifies votes in the 2008 and 2012 US Presidential elections. Estimate and find a 95% confidence interval for the population odds ratio. Explain in words what your confidence interval tells you.
library(epitools)
oddsratio(c(802,53,34,494), method="wald", conf=0.95,correct=FALSE)
## $data
## Outcome
## Predictor Disease1 Disease2 Total
## Exposed1 802 53 855
## Exposed2 34 494 528
## Total 836 547 1383
##
## $measure
## odds ratio with 95% C.I.
## Predictor estimate lower upper
## Exposed1 1.0000 NA NA
## Exposed2 219.8602 140.8909 343.0917
##
## $p.value
## two-sided
## Predictor midp.exact fisher.exact chi.square
## Exposed1 NA NA NA
## Exposed2 0 1.673741e-263 1.327956e-228
##
## $correction
## [1] FALSE
##
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"
The population odds ratio is 219.86, and the Wald CI is (140.89, 343.09). We estimate that the odds of people voting in the 2012 elections is 219.86 times higher when voting for Obama than when voting for a candidate other than Obama, and we can say with 95% confidence that people voting for Obama in 2012 elections is at least 140.89 times higher than other candidates.
-
The odds ratio can be defined as θ = (P(Y = 1|X = 1)/P(Y = 2|X = 1)) / (P(Y = 1|X = 2)/P(Y = 2|X = 2)
In case control studies we are not able to estimate P(Y = y|X = x) because the number of subjects that have each outcome level y is fixed by design. Instead we are able to estimate P(X = x|Y = y). Show mathematically why this enables us to estimate odds ratios from case-control studies, i.e. show that the odds ratio can be written in terms of things we can estimate.
Suppose we took an updated survey to compare income classification to job satisfaction, but with slightly different categories and larger n.
Inc <- as.table(rbind(c(48, 40, 31), c(69, 72, 54), c(30,51,64)))
dimnames(Inc) <- list(Income = c("<25k", "25k-70k", ">75k"),
Job_Satisfaction = c("Little","Moderate", "Very"))
Xsq <- chisq.test(Inc)
Xsq$observed
## Job_Satisfaction
## Income Little Moderate Very
## <25k 48 40 31
## 25k-70k 69 72 54
## >75k 30 51 64
(a) Provide a table with expected cell counts under the assumption that Income and Job Satisfaction are independent.
(The R calculation code is written above). The expected cell counts under the assumption that Income and Job Satisfaction are independent:
Xsq$expected
## Job_Satisfaction
## Income Little Moderate Very
## <25k 38.11111 42.25926 38.62963
## 25k-70k 62.45098 69.24837 63.30065
## >75k 46.43791 51.49237 47.06972
(b) Using the Pearson χ2-test, test whether we have sufficient evidence to reject the null hypothesis that Income and Job Satisfaction are independent at the 0.05 level. For full credit, include all calculations and R code you used to perform this test.
Xsq
##
## Pearson's Chi-squared test
##
## data: Inc
## X-squared = 18.269, df = 4, p-value = 0.001093
#alternatively,
dfred= (3-1)*(3-1) #degree of freedom
qchisq(.95, df=dfred)
## [1] 9.487729
(The R calculation code is written above). The observed p-value is 0.001093, which is much less than the significance level of 0.05. Or alternatively, the observed chi-squared X^2 statistic is 18.269 which is greater than the critical value 9.488 at df=4. Therefore, at α=0.05, we reject the null hypothesis that Income and Job Satisfaction are independent.
(c) Compute and report the standardized residuals for all the cells in the < 25k row. Explain what their values suggest about the relationship between Job Satisfaction and Income at this income level.
(The R code is written above).
Xsq$stdres
## Job_Satisfaction
## Income Little Moderate Very
## <25k 2.2574473 -0.5028431 -1.7355437
## 25k-70k 1.3253790 0.5429387 -1.8755958
## >75k -3.5373725 -0.1033062 3.6304499
Here, we can see large negative residuals for people with:
- Income level >75k who feel little job satisfaction
- Income level <25k who feel very satisfied w their jobs
- Income level 25k-70k who feel very satisfied with their job
Similarly, we can see large positive residuals for people with:
- Income level >25k who feel little job satisfaction
- Income level >75k who feel very satisfied w their jobs
This suggests that income level and job satisfaction are dependent, and the job satisfaction increases with income level.