library(survival)
## Loading required package: splines

Question 1

  1. The null hypothesis is that there is no difference in treatment, and the alternative hypothesis is that there is.
HIV=read.csv("HIV.csv")
survdiff(Surv(Time)~Drug, data=HIV)
## Call:
## survdiff(formula = Surv(Time) ~ Drug, data = HIV)
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## Drug=NoSaq 17       17     13.1      1.16      2.18
## Drug=Saq   17       17     20.9      0.73      2.18
## 
##  Chisq= 2.2  on 1 degrees of freedom, p= 0.14
1.16 + 0.73  #calculate test statistic
## [1] 1.89
1-pchisq(1.89, 1) #calculate p-value
## [1] 0.1692

The p-value is above .05, so we fail to reject the null hypothesis. There appears to be no difference in the distribution of times at which CD4 reaches the prescribed level between the two treatments.

  1. To find the point and interval estimate:
(17/13.1)/(17/20.9) #point estimate: HR = (OA/EA)/(OB/EB)
## [1] 1.595

The point estimate is larger than 1, indicating that patients taking the \(NoSaq\) treatment have a higher risk of having their CD4 count drop to the prespecified level.

To find the interval estimate:

1.59542*exp(-1.96*sqrt(1/13.1+1/20.9)) #lower bound for interval: [HR*e +/- 1.96*s ] where s = sd (log(HR)) = sqrt(1/EA + 1/EB)
## [1] 0.7997
1.59542*exp(1.96*sqrt(1/13.1+1/20.9)) #upper bound for interval
## [1] 3.183

Our interval estimate is [.7997, 3.183].

  1. Both tests have the same result: there appears to be no significant difference between treatment types. This is shown in (a) through a p-value greater than .05, and in part (b) through an interval estimate that contains 1 (indicating that the hazard ratios for the two groups could be the same).

Question 2

leaders=read.csv("Leaders.csv")
Status = (leaders$Lost > 0)+0
km=survfit(Surv(Years)~1, conf.type='plain', data=leaders)
plot(km, mark.time=F, conf.int=F)

plot of chunk unnamed-chunk-5

The maximum value of the plot appears to be 0.8 rather than 1, which is odd.

To plot the graph with the changed x-axis limits:

plot(km, mark.time=F, conf.int=F, xlim=c(-1,28))

plot of chunk unnamed-chunk-6

The maximum value of the plot is now 1, which is what we are used to seeing. Shifting the axis so that the \(y=0\) line is visible showed the real beginning of the plot, which was blocked by the immediate decrease in survival rate.

kmlead=survfit(Surv(Years, Status)~Manner, conf.type='plain', data=leaders)
plot(kmlead, mark.time=F, conf.int=F, col=1:2)

plot of chunk unnamed-chunk-7

There is some overlap between the distributions, but they each also have distinct values, leading us to believe that the distributions are not identical.

survdiff(Surv(Years, Status)~Manner, data=leaders)
## Call:
## survdiff(formula = Surv(Years, Status) ~ Manner, data = leaders)
## 
##            N Observed Expected (O-E)^2/E (O-E)^2/V
## Manner=0 295      223      237     0.879       3.1
## Manner=1 177      134      120     1.746       3.1
## 
##  Chisq= 3.1  on 1 degrees of freedom, p= 0.0785
.879+1.746 #calculate test statistic
## [1] 2.625
1-pchisq(2.625, 1) #calculate p-value
## [1] 0.1052

Our p-value is 0.105 (i.e. greater than .05), leading us to fail to reject the null hypothesis. There is not enough evidence to say that there is a difference in the distributions.

kmlead=survfit(Surv(Years, Status)~Manner, conf.type='plain', data=leaders)
summary(kmlead)
## Call: survfit(formula = Surv(Years, Status) ~ Manner, data = leaders, 
##     conf.type = "plain")
## 
##                 Manner=0 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    295      48   0.8373  0.0215       0.7952        0.879
##     1    244      25   0.7515  0.0252       0.7021        0.801
##     2    212      28   0.6522  0.0280       0.5973        0.707
##     3    177      18   0.5859  0.0292       0.5287        0.643
##     4    153      25   0.4902  0.0301       0.4313        0.549
##     5    123      25   0.3905  0.0298       0.3321        0.449
##     6     93      14   0.3318  0.0292       0.2746        0.389
##     7     74       4   0.3138  0.0290       0.2571        0.371
##     8     64       7   0.2795  0.0285       0.2235        0.335
##     9     53       1   0.2742  0.0285       0.2184        0.330
##    10     48       2   0.2628  0.0284       0.2071        0.319
##    11     44       5   0.2329  0.0282       0.1777        0.288
##    12     39       2   0.2210  0.0280       0.1662        0.276
##    13     34       2   0.2080  0.0278       0.1535        0.262
##    14     32       3   0.1885  0.0274       0.1349        0.242
##    15     29       2   0.1755  0.0270       0.1226        0.228
##    17     22       4   0.1436  0.0264       0.0919        0.195
##    18     17       2   0.1267  0.0258       0.0761        0.177
##    19     15       2   0.1098  0.0250       0.0608        0.159
##    20     13       2   0.0929  0.0238       0.0462        0.140
##    22      9       1   0.0826  0.0233       0.0369        0.128
##    24      4       1   0.0619  0.0250       0.0129        0.111
## 
##                 Manner=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    177      51    0.712  0.0340       0.6451        0.779
##     1    122      20    0.595  0.0371       0.5224        0.668
##     2     97      10    0.534  0.0380       0.4592        0.608
##     3     85      16    0.433  0.0383       0.3583        0.508
##     4     67      10    0.369  0.0376       0.2949        0.442
##     5     57       4    0.343  0.0372       0.2700        0.416
##     6     49       3    0.322  0.0368       0.2497        0.394
##     7     43       2    0.307  0.0366       0.2351        0.379
##     8     40       2    0.291  0.0363       0.2203        0.363
##     9     34       2    0.274  0.0362       0.2035        0.345
##    10     30       4    0.238  0.0357       0.1679        0.308
##    11     23       4    0.196  0.0349       0.1279        0.265
##    12     19       2    0.176  0.0342       0.1087        0.243
##    13     15       2    0.152  0.0334       0.0868        0.218
##    14     12       1    0.140  0.0329       0.0750        0.204
##    16      8       1    0.122  0.0331       0.0572        0.187

From the summary, we see that the estimate of median survival time for \(Manner=0\) is 3, whereas for \(Manner=1\) is is 2. This would lead us to suggest that there is a difference between the distributions. This is not contradictory to our results in (c) because finding a confidence interval for median survival time compares one survival aspect, not the entire survival curve.

S = function(x) 1-pnorm(x, )
  1. To perform the Mantel-Cox log-rank test of \(Years\) by \(Region\):
survdiff(Surv(Years)~Region, data=leaders)
## Call:
## survdiff(formula = Surv(Years) ~ Region, data = leaders)
## 
##            N Observed Expected (O-E)^2/E (O-E)^2/V
## Region=0  65       65     74.2     1.151     1.659
## Region=1 134      134    169.7     7.495    14.620
## Region=2  99       99    102.7     0.136     0.211
## Region=3 174      174    125.4    18.879    33.274
## 
##  Chisq= 35.9  on 3 degrees of freedom, p= 7.68e-08
1.151+7.495+.136+18.879 #calculate test statistic
## [1] 27.66
1-pchisq(27.661, 3) #calculate p-value
## [1] 4.278e-06

The p-value is essentially 0, leading us to believe that there is a relationship between region and number of years in power.

  1. Region 3 (Latin America) is largely influencing the decision we made in (g) because it contributes the most to the chi-square test statistic (18.879 out of 27.661).

LitCat = 1 + (leaders$Literacy > 20) + (leaders$Literacy > 40) + (leaders$Literacy > 60) + (leaders$Literacy > 80)
survdiff(Surv(Years, Status)~LitCat, data=leaders)
## Call:
## survdiff(formula = Surv(Years, Status) ~ LitCat, data = leaders)
## 
##            N Observed Expected (O-E)^2/E (O-E)^2/V
## LitCat=1  71       49     63.2     3.191     4.543
## LitCat=2  69       50     54.1     0.316     0.435
## LitCat=3  67       44     59.1     3.864     5.412
## LitCat=4 129      105     86.2     4.113     6.374
## LitCat=5 136      109     94.4     2.266     3.636
## 
##  Chisq= 16.3  on 4 degrees of freedom, p= 0.00267
3.191+.316+3.864+4.113+2.266 #calculate test statistic
## [1] 13.75
1-pchisq(13.75, 4) #calculate p-value
## [1] 0.008137

The p-value is less than .05, indicating that \(LitCat\) is significantly related to the duration of leaders. We also see from the log-rank test that as literacy rate increases, so does the expected length of term for leaders.

  1. To perform the stratified log-rank test for \(Manner\):
survdiff(Surv(Years, Status)~LitCat + strata(Manner), data=leaders)
## Call:
## survdiff(formula = Surv(Years, Status) ~ LitCat + strata(Manner), 
##     data = leaders)
## 
##            N Observed Expected (O-E)^2/E (O-E)^2/V
## LitCat=1  71       49     64.7      3.79     5.662
## LitCat=2  69       50     54.8      0.42     0.591
## LitCat=3  67       44     59.4      4.00     5.733
## LitCat=4 129      105     85.9      4.27     6.792
## LitCat=5 136      109     92.3      3.04     4.965
## 
##  Chisq= 19.6  on 4 degrees of freedom, p= 0.000596
3.333+.483+1.953+2.436+3.109 #calculate test statistic
## [1] 11.31
1-pchisq(11.314, 4) #calculate p-value
## [1] 0.02325

\(LitCat\) is significant in the presence of \(Manner\).

To perform the stratified log-rank test for \(Region\):

survdiff(Surv(Years, Status)~LitCat + strata(Region), data=leaders)
## Call:
## survdiff(formula = Surv(Years, Status) ~ LitCat + strata(Region), 
##     data = leaders)
## 
##            N Observed Expected (O-E)^2/E (O-E)^2/V
## LitCat=1  71       49     49.1  5.79e-05  9.67e-05
## LitCat=2  69       50     45.8  3.91e-01  5.63e-01
## LitCat=3  67       44     53.1  1.54e+00  2.28e+00
## LitCat=4 129      105     90.2  2.43e+00  4.12e+00
## LitCat=5 136      109    118.9  8.28e-01  2.38e+00
## 
##  Chisq= 6.6  on 4 degrees of freedom, p= 0.159
.0154+.1604+.4241+1.7155+1.0978 #calculate test statistic
## [1] 3.413
1-pchisq(3.4132, 4) #calculate p-value
## [1] 0.4912

\(LitCat\) is not significant in the presence of \(Region\).

To make the table:

table(leaders$Region , LitCat )
##    LitCat
##      1  2  3  4  5
##   0 10 10 22 23  0
##   1 54 37 21 22  0
##   2  7 20 10 19 43
##   3  0  2 14 65 93

Region is a counfounder for both \(Years\) and \(LitCat\) because it influences both the number of years in power and the literacy rate. Poorly educated regions tend to have both a lower literacy rate and more leaders with long terms.

  1. creating the \(LitCat\) variable transformed the literacy data from numerical to categorical, which is required when evaluating significance using a log-rank test.

  2. To fit the Weibull model:

m = survreg(Surv(Years+0.001 , Status)~Literacy , dist = 'weibull' , data=leaders)
summary(m)
## 
## Call:
## survreg(formula = Surv(Years + 0.001, Status) ~ Literacy, data = leaders, 
##     dist = "weibull")
##               Value Std. Error     z        p
## (Intercept)  2.5213    0.33849  7.45 9.43e-14
## Literacy    -0.0138    0.00498 -2.77 5.62e-03
## Log(scale)   0.9580    0.04776 20.06 1.66e-89
## 
## Scale= 2.61 
## 
## Weibull distribution
## Loglik(model)= -712.8   Loglik(intercept only)= -716.7
##  Chisq= 7.86 on 1 degrees of freedom, p= 0.0051 
## Number of Newton-Raphson Iterations: 6 
## n= 472

The \(Literacy\) coefficient is negative, indicating that as literacy increases, the number of years a leader is in power decreases. This agrees with what we found in (i). The p-value for \(Literacy\) is 5.62e-03 (i.e. less than .05), indicating that there is a significant relationship between literacy and the number of years a leader is in power.

  1. To calculate leadership means:
S70 = function(x) 1-pweibull(x,1/2.61, exp(2.5213-0.0138*70))
integrate(S70,0,Inf)
## 17.81 with absolute error < 0.0017
S80 = function(x) 1-pweibull(x,1/2.61, exp(2.5213-0.0138*80))
integrate(S80,0,Inf)
## 15.51 with absolute error < 0.0019

The ratio of these two predictions is 1.148, which is very close to 1.143, or 80/70.

m = survreg( Surv(Years+0.001 , Status)~Literacy + factor(Region) ,dist = 'weibull' , data=leaders )
summary(m)
## 
## Call:
## survreg(formula = Surv(Years + 0.001, Status) ~ Literacy + factor(Region), 
##     data = leaders, dist = "weibull")
##                     Value Std. Error       z        p
## (Intercept)      2.104983    0.49160  4.2819 1.85e-05
## Literacy        -0.000405    0.00646 -0.0627 9.50e-01
## factor(Region)1  0.257156    0.46962  0.5476 5.84e-01
## factor(Region)2 -0.243890    0.49324 -0.4945 6.21e-01
## factor(Region)3 -1.150246    0.47706 -2.4111 1.59e-02
## Log(scale)       0.944090    0.04787 19.7228 1.37e-86
## 
## Scale= 2.57 
## 
## Weibull distribution
## Loglik(model)= -706.9   Loglik(intercept only)= -716.7
##  Chisq= 19.61 on 4 degrees of freedom, p= 6e-04 
## Number of Newton-Raphson Iterations: 6 
## n= 472

When controlling for region, literacy is no longer significant (p-value=.95).

Question 3

  1. 9 2x2 tables are required.

(0,2] Dead Alive Total
Group 1 1 4 5
Group 2 0 5 5
Total 1 9 10
(2, 7] Dead Alive Total
Group 1 1 3 4
Group 2 1 4 5
Total 3 6 9
  1. Expected number of failures for each group from the second table:

Group 1 Dead: 4/9

Group 2 Dead: 5/9

Group 1 Alive: 5/9

Group 2 Alive: 4/9

Question 4

  1. False

  2. True

  3. True

  4. True

  5. False

  6. True

  7. True

  8. True

  9. True