#Question 1
BD <- 302
BI <- 80
WD <- 871
WI <- 444
totalB <- BD + BI
totalW <- WD + WI
totalD <- BD + WD
totalI <- BI + WI
#part a, point estimates for difference and RR and odds ratio
#pi_1 = P(D|B), pi_2 = P(D|W)
pi_1 <- BD/totalB
pi_2 <- WD/totalW
diff <- pi_1 - pi_2
RR <- pi_1/pi_2
odds_ratio <- (BD*WI)/(BI*WD)
#OR
odds_2 <- (pi_1/(1-pi_1))/((pi_2/(1-pi_2)))
Differece: 0.1282185
Relative Risk: 1.193579
Odds Ratio: 1.9243398
The point estimate for the difference is .1282. This means that the difference between the probability of a randomly chosen person who identifies as black being a democrat and the probability of a randomly chosen person who identifies as white being a democrat is .1282.
The point estimate for the relative risk is 1.1936. This means that the probability of being a democrat if you identify as black is estimated to be 1.1936 times the probability of being a democrat if you identify as white.
The point estimate for the odds ratio is 1.9243. This means that the odds of being a democrat if you identify as black is estimated to be 1.9243 times the odds of being a democrat if you identify as white.
#part c
#difference CI
CI_diff <- diff + c(-1,1)*qnorm(0.975)*sqrt(((pi_1*(1-pi_1))/totalD)+((pi_2*(1-pi_2))/totalI))
#RR CI
log <- log(pi_1/pi_2)
SE_log <- sqrt((1-pi_1)/BD + (1-pi_2)/WD)
CI_RR_log <- log + c(-1,1)*qnorm(0.975)*SE_log
CI_RR <- exp(CI_RR_log)
#odds ratio CI
log_odds <- log(odds_ratio)
SE_log_odds <- sqrt((1/BD)+(1/BI)+(1/WD)+(1/WI))
CI_log_odds <- log_odds + c(-1,1)*qnorm(0.975)*SE_log_odds
CI_odds <- exp(CI_log_odds)
CI_diff
## [1] 0.08150959 0.17492742
CI_RR
## [1] 1.119086 1.273031
CI_odds
## [1] 1.466561 2.525012
The 95% confidence interval for the relative risk is 1.1191 to 1.2730. This means that the probability of being a democrat if you identify as black is between 1.1191 and 1.2730 times the probability of being a democrat if you identify as white. 1 is not in this interval, so these results are significant.
The 95% confidence interval for the odds ratio is 1.4666 to 2.5250. This means that the odds of being a democrat if you identify as black are at least 1.4666 times and as many as 2.5250 times the odds of being a democrat if you identify as white. One is not in this interval, so we can conclude that the odds of being a democrat for those who identify as black are higher than the odds of being a democrat for those who identify as white.
#part e
#let's make a MATRIX
data <- matrix(c(BD, BI, WD, WI), nrow = 2, ncol = 2, byrow = TRUE, dimnames = list(c("Black","White"), c("Democrat", "Independent")))
counts <- c(BD, BI, WD, WI)
overall_total <- sum(counts)
expected <- c((totalB*totalD)/overall_total, (totalB*totalI)/overall_total,(totalW*totalD)/overall_total,(totalW*totalI)/overall_total)
chisq <- sum((counts-expected)^2/expected)
p_chisquared <- 1-pchisq(chisq,1)
test <- chisq.test(data)
#likelihood
g_squared <- 2*( BD*log(BD/expected[1]) + BI*log(BI/expected[2]) + 871*log(871/expected[3]) +
444*log(444/expected[4]))
p_likelihood <- 1-pchisq(g_squared,1)
test$p.value
## [1] 2.452072e-06
p_likelihood
## [1] 9.406615e-07
For the Pearson test, the test statistic is 22.204 with 1 degree of freedom. The p-value is 2.452072110^{-6}. At the 95% significance level, we reject the null hypothesis. There is sufficient evidence to suggest that race and political party are not independent.
For the likelihood test, the test statistic is 24.0459039 with 1 degree of freedom. The p-value is 9.406615210^{-7}. At the 95% significance level, we reject the null hypothesis. There is sufficient evidence to suggest that race and political party are not independent.
stand_res <- test$stdres
print(stand_res)
## Democrat Independent
## Black 4.77498 -4.77498
## White -4.77498 4.77498
Question 2
#question 2
#let's make a matrix
table <- matrix(c(21, 2, 15, 3), nrow = 2, ncol = 2, byrow = TRUE, dimnames = list(c("Surgery", "Radiation Therapy"), c("Cancer Controlled", "Cancer Not Controlled")))
fisher.test(table, alternative = "greater")
##
## Fisher's Exact Test for Count Data
##
## data: table
## p-value = 0.3808
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.2864828 Inf
## sample estimates:
## odds ratio
## 2.061731
table
## Cancer Controlled Cancer Not Controlled
## Surgery 21 2
## Radiation Therapy 15 3
p_obs <-(choose(23,21)*choose(18,15))/choose(41,36) + (choose(23, 22)*(choose(18, 14)/choose(41,36)) + choose(23,23)*choose(18,13)/choose(41,36))
The p-value associated with the fisher’s exact test is 0.3808337. At alpha equal to 0.05, we fail to reject the null hypothesis. There is no sufficient evidence to suggest an association between treatment and whether the cancer is controlled.
Question 3
This scenario is an example of Simpson’s paradox. This is a situation where a statistical trend disappears or reverses when looking at partitioned data versus all the data as a whole. In this specific case, it’s possible for the overall death rate to be higher in Maine than in South Carolina even if the death rate is higher in every age group in South Carolina because Maine could have a higher population of people in the age group where the death rate is highest. For example, South Carolina could have a death rate for people over 65 that is very high but not a large population of people over 65. Main’s death rate could be lower for this age group but it could have a large number of people in this group that brings the overall death rate down.
Batting average is calculated by dividing the total number of a batter’s safe hits by their number of times at bat. In this case, it is possible for Smith to have a higher batting average for each year individually but for Jones to have a higher overall batting average. This is illustrated in the data below. Smith’s batting average is higher in each individual year, but Jones has more “at bats” overall and has a higher average overall.
year05 <- matrix(c(120,300,100,300), nrow = 2, ncol=2,byrow = TRUE, dimnames = list(c("Smith" , "Jones"), c("Hits", "At Bats")) )
year06 <- matrix(c(8,10, 70,100), nrow = 2, ncol=2,byrow = TRUE, dimnames = list(c("Smith" , "Jones"), c("Hits", "At Bat")) )
avg05Smith <- year05[1]/year05[3]
avg05Jones <- year05[2]/year05[4]
avg06Smith <- year06[1]/year06[3]
avg06Jones <- year06[2]/year06[4]
overallSmith <- (year05[1]+year06[1])/(year05[3]+year06[3])
overallJones <- (year05[2]+year06[2])/(year05[4]+year06[4])
Jones 2005 and 2006: 0.3333333 and 0.7
Smith 2005 and 2006: 0.4 and 0.8
Overall Jones: 0.425
Overall Smith: 0.4129032
Question 4
#question 4
prev <- 0.08
sensitivity <- .95
specificity <- .20
PPV <- (prev*sensitivity)/((prev*sensitivity)+((1-prev)*specificity))
Given these parameters, the probability of someone actually having the disease is 0.2923077.
Question 5
#question 5
#individual test positive twice
#binomial dist with p success PPV
probability <- dbinom(2,2,PPV)
#OR by hand
n=2
k=2
prob_again <- choose(n,k)*PPV^k*(1-PPV)^(n-k)
Question 5
a. The probability that an individual tests positive twice is 0.0854438.
b. Let \(PPV\) = positive predicted value, \(k\) = number of times and individual tests positive, and \(n\) = number of times individual is tested. \(P = {n\choose k}*PPV^k * (1-PPV)^{n-k}\)
Question 6
clinicone <- matrix(c(18,12,12,8), byrow = FALSE, nrow = 2, dimnames = list(c("A","B"), c("Success", "Failure")))
clinictwo <- matrix(c(2,8,8,32), byrow = FALSE, nrow = 2, dimnames = list(c("A","B"), c("Success", "Failure")))
#calculate conditional associations, show they equal 1
#clinic one
total <- ((18*8)/(12*12))
#clinic two
totaltwo <- ((2*32)/(8*8))
#marginal association
marg <- ((clinicone[1,1]+clinictwo[1,1])*(clinicone[2,2]+clinictwo[2,2]))/((clinicone[2,1]+clinictwo[2,1])*(clinicone[1,2]+clinictwo[1,2]))
#sample odds of success 100% higher for drug treatment A than B
As shown above, the conditional odds ratios between X and Y both equal one. This means that X and Y are independent in each partial table. On the other hand, the marginal odds ratio is two. This means that the odds of success for treatment A is twice the odds of success for treatment B when you ignore the clinic. When we control for the clinic (i.e. the conditional odds ratios) this relationship disappears.