Categorical Data Analysis Homework 2

#Question 1

BD <- 302
BI <- 80
WD <- 871
WI <- 444

totalB <- BD + BI
totalW <- WD + WI

totalD <- BD + WD
totalI <- BI + WI

#part a, point estimates for difference and RR and odds ratio
#pi_1 = P(D|B), pi_2 = P(D|W)

pi_1 <- BD/totalB
pi_2 <- WD/totalW

diff <- pi_1 - pi_2
RR <- pi_1/pi_2
odds_ratio <- (BD*WI)/(BI*WD)

#OR
odds_2 <- (pi_1/(1-pi_1))/((pi_2/(1-pi_2)))

Differece: 0.1282185
Relative Risk: 1.193579
Odds Ratio: 1.9243398
The point estimate for the difference is .1282. This means that the difference between the probability of a randomly chosen person who identifies as black being a democrat and the probability of a randomly chosen person who identifies as white being a democrat is .1282.

The point estimate for the relative risk is 1.1936. This means that the probability of being a democrat if you identify as black is estimated to be 1.1936 times the probability of being a democrat if you identify as white.

The point estimate for the odds ratio is 1.9243. This means that the odds of being a democrat if you identify as black is estimated to be 1.9243 times the odds of being a democrat if you identify as white.

#part c

#difference CI

CI_diff <- diff + c(-1,1)*qnorm(0.975)*sqrt(((pi_1*(1-pi_1))/totalD)+((pi_2*(1-pi_2))/totalI))

#RR CI

log <- log(pi_1/pi_2)

SE_log <- sqrt((1-pi_1)/BD + (1-pi_2)/WD)

CI_RR_log <- log + c(-1,1)*qnorm(0.975)*SE_log
CI_RR <- exp(CI_RR_log)

#odds ratio CI

log_odds <- log(odds_ratio)

SE_log_odds <- sqrt((1/BD)+(1/BI)+(1/WD)+(1/WI))
CI_log_odds <- log_odds + c(-1,1)*qnorm(0.975)*SE_log_odds
CI_odds <- exp(CI_log_odds)

CI_diff

## [1] 0.08150959 0.17492742

CI_RR

## [1] 1.119086 1.273031

CI_odds

## [1] 1.466561 2.525012

The 95% confidence interval for the difference in proportions is 0.0815 to 0.1749. This means that with 95% confidence, the probability of being a democrat for those who identify as black is estimated to exceed the probability of being a democrat for those who identify as white by as little as 0.0815 and as much as 0.1749. Zero is not in this interval, and the whole interval is positive. Therefore, with 95% confidence, we think that the probability of being a democrat for those who identify as black to be greater than the probability of being a democrat for those who identify as white.

The 95% confidence interval for the relative risk is 1.1191 to 1.2730. This means that the probability of being a democrat if you identify as black is between 1.1191 and 1.2730 times the probability of being a democrat if you identify as white. 1 is not in this interval, so these results are significant.

The 95% confidence interval for the odds ratio is 1.4666 to 2.5250. This means that the odds of being a democrat if you identify as black are at least 1.4666 times and as many as 2.5250 times the odds of being a democrat if you identify as white. One is not in this interval, so we can conclude that the odds of being a democrat for those who identify as black are higher than the odds of being a democrat for those who identify as white.

#part e
#let's make a MATRIX 

data <- matrix(c(BD, BI, WD, WI), nrow = 2, ncol = 2, byrow = TRUE, dimnames = list(c("Black","White"), c("Democrat", "Independent")))
counts <- c(BD, BI, WD, WI)
overall_total <- sum(counts)

expected <- c((totalB*totalD)/overall_total, (totalB*totalI)/overall_total,(totalW*totalD)/overall_total,(totalW*totalI)/overall_total)

chisq <- sum((counts-expected)^2/expected)
p_chisquared <- 1-pchisq(chisq,1)

test <- chisq.test(data) 

#likelihood

g_squared <- 2*( BD*log(BD/expected[1]) + BI*log(BI/expected[2]) + 871*log(871/expected[3]) + 
              444*log(444/expected[4]))

p_likelihood <- 1-pchisq(g_squared,1)

test$p.value

## [1] 2.452072e-06

p_likelihood

## [1] 9.406615e-07

Both the Pearson and Likelihood tests of independence, the null hypothesis is that there is no association between race and political party, or that race and political party are independent. The alternative hypothesis is that race and political party are not independent.

For the Pearson test, the test statistic is 22.204 with 1 degree of freedom. The p-value is 2.452072110^{-6}. At the 95% significance level, we reject the null hypothesis. There is sufficient evidence to suggest that race and political party are not independent.

For the likelihood test, the test statistic is 24.0459039 with 1 degree of freedom. The p-value is 9.406615210^{-7}. At the 95% significance level, we reject the null hypothesis. There is sufficient evidence to suggest that race and political party are not independent.

The results from the confidence intervals and results from the tests of independence agree in their conclusion. The confidence intervals showed significant results and lead us to believe that there is a difference between the proportion of democrats who identify as black and who identify as white and that the relative risk and odds ratios are not one. The tests for indpendence show an association between race and political party. If there were no association, we would expect that there would not be a signficant difference in proportions and we would expect the relative risk and odds ratios to both be one, which is not the case.

stand_res <- test$stdres
print(stand_res)

##       Democrat Independent
## Black  4.77498    -4.77498
## White -4.77498     4.77498

The standardized residuals in every cell have an absolute value greater than 2. This indicates that the null hypothesis, that political party and race are independent, is not a good fit in any of the cells. The residual value for those who identify as black and are democrat is large and positive. This means that there is a higher count in this cell than the null hypothesis predicts. This is also true for those who identify as white and who are independent. The residual value for those who identify as white and a democrat is large and negative. The count in this cell is smaller than the null hypothesis predicts. This is the case for those who identify as black and who are independent as well.

Question 2

#question 2
#let's make a matrix

table <- matrix(c(21, 2, 15, 3), nrow = 2, ncol = 2, byrow = TRUE, dimnames = list(c("Surgery", "Radiation Therapy"), c("Cancer Controlled", "Cancer Not Controlled")))
fisher.test(table, alternative = "greater")

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table
## p-value = 0.3808
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.2864828       Inf
## sample estimates:
## odds ratio 
##   2.061731

table

##                   Cancer Controlled Cancer Not Controlled
## Surgery                          21                     2
## Radiation Therapy                15                     3

p_obs <-(choose(23,21)*choose(18,15))/choose(41,36) + (choose(23, 22)*(choose(18, 14)/choose(41,36)) + choose(23,23)*choose(18,13)/choose(41,36))

The p-value associated with the fisher’s exact test is 0.3808337. At alpha equal to 0.05, we fail to reject the null hypothesis. There is no sufficient evidence to suggest an association between treatment and whether the cancer is controlled.

Question 3

This scenario is an example of Simpson’s paradox. This is a situation where a statistical trend disappears or reverses when looking at partitioned data versus all the data as a whole. In this specific case, it’s possible for the overall death rate to be higher in Maine than in South Carolina even if the death rate is higher in every age group in South Carolina because Maine could have a higher population of people in the age group where the death rate is highest. For example, South Carolina could have a death rate for people over 65 that is very high but not a large population of people over 65. Main’s death rate could be lower for this age group but it could have a large number of people in this group that brings the overall death rate down.
Batting average is calculated by dividing the total number of a batter’s safe hits by their number of times at bat. In this case, it is possible for Smith to have a higher batting average for each year individually but for Jones to have a higher overall batting average. This is illustrated in the data below. Smith’s batting average is higher in each individual year, but Jones has more “at bats” overall and has a higher average overall.

year05 <- matrix(c(120,300,100,300), nrow = 2, ncol=2,byrow = TRUE, dimnames = list(c("Smith" , "Jones"), c("Hits", "At Bats")) )
year06 <- matrix(c(8,10, 70,100), nrow = 2, ncol=2,byrow = TRUE, dimnames = list(c("Smith" , "Jones"), c("Hits", "At Bat")) )

avg05Smith <- year05[1]/year05[3]
avg05Jones <- year05[2]/year05[4]

avg06Smith <- year06[1]/year06[3]
avg06Jones <- year06[2]/year06[4]

overallSmith <- (year05[1]+year06[1])/(year05[3]+year06[3])
overallJones <- (year05[2]+year06[2])/(year05[4]+year06[4])

Jones 2005 and 2006: 0.3333333 and 0.7
Smith 2005 and 2006: 0.4 and 0.8
Overall Jones: 0.425
Overall Smith: 0.4129032

Question 4

#question 4
prev <- 0.08
sensitivity <- .95
specificity <- .20

PPV <- (prev*sensitivity)/((prev*sensitivity)+((1-prev)*specificity))

Given these parameters, the probability of someone actually having the disease is 0.2923077.

Question 5

#question 5
#individual test positive twice
#binomial dist with p success PPV

probability <- dbinom(2,2,PPV)

#OR by hand
n=2
k=2
prob_again <- choose(n,k)*PPV^k*(1-PPV)^(n-k)

Question 5
a. The probability that an individual tests positive twice is 0.0854438.
b. Let \(PPV\) = positive predicted value, \(k\) = number of times and individual tests positive, and \(n\) = number of times individual is tested. \(P = {n\choose k}*PPV^k * (1-PPV)^{n-k}\)

Question 6

clinicone <- matrix(c(18,12,12,8), byrow = FALSE, nrow = 2, dimnames = list(c("A","B"), c("Success", "Failure")))
clinictwo <- matrix(c(2,8,8,32), byrow = FALSE, nrow = 2, dimnames = list(c("A","B"), c("Success", "Failure")))

#calculate conditional associations, show they equal 1
#clinic one
total <- ((18*8)/(12*12))

#clinic two
totaltwo <- ((2*32)/(8*8))

#marginal association
marg <- ((clinicone[1,1]+clinictwo[1,1])*(clinicone[2,2]+clinictwo[2,2]))/((clinicone[2,1]+clinictwo[2,1])*(clinicone[1,2]+clinictwo[1,2]))

#sample odds of success 100% higher for drug treatment A than B

As shown above, the conditional odds ratios between X and Y both equal one. This means that X and Y are independent in each partial table. On the other hand, the marginal odds ratio is two. This means that the odds of success for treatment A is twice the odds of success for treatment B when you ignore the clinic. When we control for the clinic (i.e. the conditional odds ratios) this relationship disappears.

Categorical Data Analysis Homework 2

Holly Michalak

January 29, 2019