Question 1

Consider the mortality table based on 608 known deaths of Dall Mountain Sheep. The data are expressed per 1,000 sheep on the table below

Compute and plot the Survival Function

Surv.1 <- Surv(time = df.1.t$Start_Interval, 
     time2 = df.1.t$End_Interval, 
     event = rep(1, nrow(df.1.t)),
     type = "interval")

summary(fit.1 <- survfit(Surv.1~1))
## Call: survfit(formula = Surv.1 ~ 1)
## 
##  time  n.risk n.event survival  std.err lower 95% CI upper 95% CI
##     0 1001.00   199.2    0.801 1.01e-02      0.77665      0.82611
##     1  801.80    12.0    0.789 1.02e-02      0.76413      0.81467
##     2  789.79    13.0    0.776 1.02e-02      0.75061      0.80225
##     3  776.78    12.0    0.764 1.02e-02      0.73815      0.79075
##     4  764.76    30.0    0.734 1.02e-02      0.70714      0.76188
##     5  734.73    46.0    0.688 1.01e-02      0.65990      0.71730
##     6  688.69    48.0    0.640 9.71e-03      0.61096      0.67042
##     7  640.64    69.1    0.571 8.93e-03      0.54116      0.60248
##     8  571.57   132.1    0.439 6.88e-03      0.40932      0.47083
##     9  439.44   187.2    0.252 3.46e-03      0.22650      0.28037
##    10  252.25   156.2    0.096 8.93e-04      0.07939      0.11609
##    11   96.10    90.1    0.006 1.46e-05      0.00270      0.01331
##    12    6.01     3.0    0.003 5.18e-06      0.00097      0.00928
##    13    3.00     3.0    0.000 0.00e+00           NA           NA

Compute and plot the Probability Function

df.1 <- df.1 %>% 
  mutate("f(t)" = Number_of_Deaths/(max(Number_Of_Survivors) * (End_Interval - Start_Interval)))

Computer and plot the Hazard Function

df.1 <- df.1 %>% 
  mutate(Deaths_Per_Time = Number_of_Deaths/(End_Interval - Start_Interval),
         "h(t)" = Deaths_Per_Time/(Number_Of_Survivors - Number_of_Deaths/2))

Discussing the Hazard Function

Looking at the hazard function we see that the first value is high possible indicating infant mortality. After the sheep lives past this point the hazard function increases over time typical behavior for life span which according to the National Park Service is 12 to 16 years. No sheep in our sample lived to 16 years I suspect this is due to how the sample was collected. Is is unlikely that humans were recording when the sheep were born and more likely they caught young sheep and kept up with them in the wild. It might be possible that the sheep dying at 13 years might be 16 years old. This hypothesis would explain the low initial hazard function because death at births are not captured.

Mean and Median Survival Times

print(fit.1, print.rmean = TRUE)
## Call: survfit(formula = Surv.1 ~ 1)
## 
##      records    n events rmean* se(rmean) median 0.95LCL 0.95UCL
## [1,]    1000 1001   1001   6.56     0.122      8       8       8
##     * restricted mean with upper limit =  13

We see that the mean survival time is 6.559 years and the median 8 years

Question 2

Consider the remission data of 40 female patients diagnosed with breast cancer in a certain hospital during the years 1989–93. The data include: the age of the patient as of last contact (follow up/death), stage of cancer at time of diagnosis, date of last contact, status (alive/dead), the number of days lived, the number of years lived. The data are given on the following page.

Compute and plot the PL estimates of S(t) at every time to death for the Stage 1 and 2 groups.

Surv.2 <- Surv(df.2$Years, df.2$Status.l)
fit.2 <- survfit(Surv.2 ~ Stage, data = df.2 , robust = TRUE)
summary(fit.2)
## Call: survfit(formula = Surv.2 ~ Stage, data = df.2, robust = TRUE)
## 
##                 Stage=1 
##   time n.risk n.event survival std.err lower 95% CI upper 95% CI
##  0.991     18       1    0.944  0.0540        0.844         1.00
##  1.021     17       1    0.889  0.0741        0.755         1.00
##  1.142     16       1    0.833  0.0878        0.678         1.00
##  2.264     11       1    0.758  0.1077        0.573         1.00
##  3.578      7       1    0.649  0.1362        0.430         0.98
##  4.416      4       1    0.487  0.1738        0.242         0.98
##  5.098      3       1    0.325  0.1760        0.112         0.94
## 
##                 Stage=2 
##   time n.risk n.event survival std.err lower 95% CI upper 95% CI
##  0.246     21       1    0.952  0.0465       0.8655        1.000
##  0.389     20       1    0.905  0.0641       0.7875        1.000
##  0.652     19       1    0.857  0.0764       0.7198        1.000
##  0.797     18       1    0.810  0.0857       0.6579        0.996
##  1.098     17       1    0.762  0.0929       0.5999        0.968
##  1.736     15       1    0.711  0.0997       0.5403        0.936
##  1.944     14       1    0.660  0.1047       0.4839        0.901
##  2.004     13       1    0.610  0.1083       0.4303        0.863
##  2.292     12       1    0.559  0.1105       0.3792        0.823
##  2.839     10       1    0.503  0.1127       0.3241        0.780
##  3.157      8       1    0.440  0.1148       0.2638        0.734
##  3.307      7       1    0.377  0.1143       0.2082        0.683
##  3.912      6       1    0.314  0.1112       0.1571        0.629
##  4.578      3       1    0.210  0.1132       0.0727        0.604

Comparing Survival Functions

survdiff(Surv.2 ~ Stage, data = df.2) 
## Call:
## survdiff(formula = Surv.2 ~ Stage, data = df.2)
## 
##          N Observed Expected (O-E)^2/E (O-E)^2/V
## Stage=1 19        7     10.5      1.15      2.33
## Stage=2 21       14     10.5      1.14      2.33
## 
##  Chisq= 2.3  on 1 degrees of freedom, p= 0.1

Doing a log rank test we see the curves are significantly different at the .1 level but not at the .05 level. Looking at the curves we see that stage 2 had a smaller survival curve throughout time which makes practical sense since stage 2 cancer is more intense then stage 1.

Compute the survival function and the variance

To calculate Survival and variance at a particular point we will linearly interpolate the points closest to the point at interest that have had events.

For the variance of S(t) We can use Greenwood’s Formula

\[\widehat{Var}[\hat{S}(t)] = \hat{S}(t)^2 \sum_{t_i \leq t}\frac{d_i}{r_i(r_i - d_i)} \] Where \(r_i\) is the number at risk right before \(t\) and \(d_i\) is the number of deaths occurred at time \(t\)

fit.df.2 <- broom::tidy(fit.2)
fit.df.2 <- fit.df.2 %>% 
  group_by(strata) %>% 
  mutate(variance = estimate^2 * cumsum(n.event/(n.risk * (n.risk - n.event)))) %>% #Estimating the variance 
  ungroup()

At S(2) for stage 1 patients

s.2<- sliced.2 %$% 
  approx(time, estimate, 2)$y #Does linear interpolation for s(2)
v.2 <- sliced.2 %$%
  approx(time, variance, 2)$y #Does linear interpolation for var(s(2))

Using the linear interpolation at \(S(2)\) we get a value of 0.7754 and a variance of 0.0107

At S(3) for stage 2 patients

s.3<- sliced.3 %$% 
  approx(time, estimate, 3)$y #Does linear interpolation for s(2)
v.3 <- sliced.3 %$%
  approx(time, variance, 3)$y #Does linear interpolation for var(s(2))

Using the linear interpolation at \(S(3)\) we get a value of 0.471 and a variance of 0.0129.

What is the Median Remission Times

print(fit.2)
## Call: survfit(formula = Surv.2 ~ Stage, data = df.2, robust = TRUE)
## 
##          n events median 0.95LCL 0.95UCL
## Stage=1 19      7   4.42    3.58      NA
## Stage=2 21     14   3.16    1.94      NA

We see that for that the median survival time of Stage 1 is 4.42 years and the median survival time of Stage 2 is 3.16

Question 3

Using the data in problem, Q2, is there a difference in the survival functions by stage of cancer assessed at the time of diagnosis?

In Question 2 we examined the differences between the 2 survival functions of cancer patients at different stages of cancer. Conceptually we would imagine that the stage 2 cancer patients would have a greater risk of death since stage 2 cancer is greater intensity then stage 1.

## Call: survfit(formula = Surv.2 ~ Stage, data = df.2, robust = TRUE)
## 
##          n events median 0.95LCL 0.95UCL
## Stage=1 19      7   4.42    3.58      NA
## Stage=2 21     14   3.16    1.94      NA

When simply looking at the descriptive statistics of we see that the stage 1 had 7 deaths while stage 2 had 14 deaths. With comparable sample sizes. The stage 1 median is also less then the median of stage 2 patients. We examine the survival curves plots we see that stage 2 patients have a lower survival curve through out the time domain. When doing the statistical test to see if the groups are different we have significance at the .1 level. In conclusion it is fair to say that stage 1 cancer patients survive longer then stage 2 patients.