Homework 2

Question 1

(Agresti, 2.2) For diagnostic testing, let X = true status (1 = disease, 2 = no disease) and Y = diagnosis (1 = positive, 2 = negative). Let π1 = P(Y = 1|X = 1) and π2 = P(Y = 1|X = 2). Let γ denote the probability that a subject has the disease.

(a) Given that the diagnosis is positive, use Bayes’ Theorem to show that the probability a subject truly has the disease is P(X = 1 | Y = 1) = π1γ/(π1γ + π2(1 − γ)).

The probability of getting a positive diagnosis, P(Y=1), is:

P(Y=1) =Probability of testing positive while having the disease + Probability of testing positive while not having the disease\[ P(Y=1) = P(Y=1 ∩ X=1) + P(Y=1 ∩ X=2) \] \[ = P(Y=1|X=1).P(X=1) + P(Y=1|X=2).P(X=2) \] Now, According to Bayes’ Theorem,

\[ P(A|B)=\frac{P(B | A).P(A)}{P(B)} \]

Therefore,

\[ P(X=1|Y=1)=\frac{P(Y=1|X=1).P(X=1)}{P(Y=1)} \]

\[ P(X=1|Y=1)=\frac{P(Y=1|X=1).P(X=1)}{P(Y=1|X=1).P(X=1) + P(Y=1|X=2).P(X=2)} \]

Now, substituting the values from the question, (π1 = P(Y = 1|X = 1) and π2 = P(Y = 1|X = 2), γ= P(X=1), γ-1 = P(X-2)

\[ P(X=1|Y=1)=\frac{π1.γ}{π1.γ + π2.(1-γ)} \]

(b) For mammograms for detecting breast cancer, suppose γ = 0.01, sensitivity = π1 =0.86, and specificity = 1 − π2 = 0.88. Find the positive predictive value.

Here,

\[ P(X=1|Y=1)=\frac{π1.γ}{π1.γ + π2.(1-γ)} \]

\[ P(X=1|Y=1)=\frac{0.86*0.01}{0.86*0.01 + (1-0.88)*(1-0.01)}=0.0675 \]

Therefore, the probability of testing positive while being truly sick is 0.0675.

(c) To better understand the answer in (b), find the joint probabilities for the 2 × 2 cross classification of X and Y . Discuss their relative sizes in the two cells that refer to a positive test result.

Here,

Y= 0.01
pi_1 =0.86
pi_2 = (1-0.88)

P_X1∩Y1 = pi_1*Y
P_X1∩Y2 = Y-P_X1∩Y1
P_X2∩Y1 = pi_2*(1-Y)
P_X2∩Y2 = (1-Y)-P_X2∩Y1

pos_total=P_X1nY1+P_X2∩Y1
neg_total=P_X2∩Y2+P_X1∩Y2

P_X1∩Y1

## [1] 0.0086

P_X1∩Y2

## [1] 0.0014

P_X2∩Y1

## [1] 0.1188

P_X2∩Y2

## [1] 0.8712

pos_total

## [1] 0.1274

neg_total

## [1] 0.8726

	Y=1	Y=2	Total
X=1	0.0086	0.0014	0.01
X=2	0.1188	0.8712	0.99
Total	0.1274	0.8726	1

Cells containing positive test results are:

P(X=1 ∩ Y=1) = 0.0086
P(X=2 ∩ Y=1) = 0.1188

Therefore, the probability of women who do not have cancer but test positive is greater than the probability of women who have cancer and test positive.

Question 2

For adults who sailed on the Titanic on its fateful voyage, the odds ratio between gender (female, male) and survival (yes, no) was 11.4.

(a) What is wrong with the interpretation, “The probability of survival for females was 11.4 times that for males?” Give the correct interpretation.

The interpretation is wrong, because the odds ratio gives ratio of the odds of a particular event occurring in one group to the odds of an event occurring in anther group. It is not a probability, and cannot be generalized to all events, and only applies to the particular event under the particular circumstances, which in this case is the event of survival on the Titanic ship.

The correct interpretation is: for the adults who sailed on the Titanic, the odds of survival for women on the ship was 11.4 times the odds of survival for their men counterparts.

(b) The odds of survival for females equaled 2.9. For each gender, find the proportion who survived. Find the value of RR in the interpretation, “The probability of survival for females was RR times that for males.”

odds_r = 11.4

odds_F = 2.9
prop_F =odds_F/(odds_F+1)

odds_M = (odds_F)/odds_r
prop_M=odds_M/(odds_M+1)

RR=prop_F/prop_M

prop_F

## [1] 0.7435897

prop_M

## [1] 0.2027972

RR

## [1] 3.666667

The proportion who survived the Titanic voyage was 0.744 for females and 0.203 for males. The relative risk, RR, is 3.667. Therefore, the probability of survival for females was 3.667 times that for males.

Question 3

Table 2.11, below, cross-classifies votes in the 2008 and 2012 US Presidential elections. Estimate and find a 95% confidence interval for the population odds ratio. Explain in words what your confidence interval tells you.

library(epitools)
oddsratio(c(802,53,34,494), method="wald", conf=0.95,correct=FALSE)

## $data
##           Outcome
## Predictor  Disease1 Disease2 Total
##   Exposed1      802       53   855
##   Exposed2       34      494   528
##   Total         836      547  1383
## 
## $measure
##           odds ratio with 95% C.I.
## Predictor  estimate    lower    upper
##   Exposed1   1.0000       NA       NA
##   Exposed2 219.8602 140.8909 343.0917
## 
## $p.value
##           two-sided
## Predictor  midp.exact  fisher.exact    chi.square
##   Exposed1         NA            NA            NA
##   Exposed2          0 1.673741e-263 1.327956e-228
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

The population odds ratio is 219.86, and the Wald CI is (140.89, 343.09). We estimate that the odds of people voting in the 2012 elections is 219.86 times higher when voting for Obama than when voting for a candidate other than Obama, and we can say with 95% confidence that people voting for Obama in 2012 elections is at least 140.89 times higher than other candidates.

Question 4

The odds ratio can be defined as θ = (P(Y = 1|X = 1)/P(Y = 2|X = 1)) / (P(Y = 1|X = 2)/P(Y = 2|X = 2)

In case control studies we are not able to estimate P(Y = y|X = x) because the number of subjects that have each outcome level y is fixed by design. Instead we are able to estimate P(X = x|Y = y). Show mathematically why this enables us to estimate odds ratios from case-control studies, i.e. show that the odds ratio can be written in terms of things we can estimate.

Question 5

Suppose we took an updated survey to compare income classification to job satisfaction, but with slightly different categories and larger n.

 Inc <- as.table(rbind(c(48, 40, 31), c(69, 72, 54), c(30,51,64)))
 dimnames(Inc) <- list(Income = c("<25k", "25k-70k", ">75k"),
                    Job_Satisfaction = c("Little","Moderate", "Very"))
   
Xsq <- chisq.test(Inc)  
 
Xsq$observed

##          Job_Satisfaction
## Income    Little Moderate Very
##   <25k        48       40   31
##   25k-70k     69       72   54
##   >75k        30       51   64

(a) Provide a table with expected cell counts under the assumption that Income and Job Satisfaction are independent.

(The R calculation code is written above). The expected cell counts under the assumption that Income and Job Satisfaction are independent:

Xsq$expected

##          Job_Satisfaction
## Income      Little Moderate     Very
##   <25k    38.11111 42.25926 38.62963
##   25k-70k 62.45098 69.24837 63.30065
##   >75k    46.43791 51.49237 47.06972

(b) Using the Pearson χ2-test, test whether we have sufficient evidence to reject the null hypothesis that Income and Job Satisfaction are independent at the 0.05 level. For full credit, include all calculations and R code you used to perform this test.

Xsq

## 
##  Pearson's Chi-squared test
## 
## data:  Inc
## X-squared = 18.269, df = 4, p-value = 0.001093

#alternatively,
dfred= (3-1)*(3-1) #degree of freedom
qchisq(.95, df=dfred)

## [1] 9.487729

(The R calculation code is written above). The observed p-value is 0.001093, which is much less than the significance level of 0.05. Or alternatively, the observed chi-squared X^2 statistic is 18.269 which is greater than the critical value 9.488 at df=4. Therefore, at α=0.05, we reject the null hypothesis that Income and Job Satisfaction are independent.

(c) Compute and report the standardized residuals for all the cells in the < 25k row. Explain what their values suggest about the relationship between Job Satisfaction and Income at this income level.

(The R code is written above).

Xsq$stdres

##          Job_Satisfaction
## Income        Little   Moderate       Very
##   <25k     2.2574473 -0.5028431 -1.7355437
##   25k-70k  1.3253790  0.5429387 -1.8755958
##   >75k    -3.5373725 -0.1033062  3.6304499

Here, we can see large negative residuals for people with:
- Income level >75k who feel little job satisfaction
- Income level <25k who feel very satisfied w their jobs
- Income level 25k-70k who feel very satisfied with their job

Similarly, we can see large positive residuals for people with:
- Income level >25k who feel little job satisfaction
- Income level >75k who feel very satisfied w their jobs

This suggests that income level and job satisfaction are dependent, and the job satisfaction increases with income level.