R Markdown

library("survival")
library("lubridate")
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library("ggplot2")
library("survminer")
## Loading required package: ggpubr
## Loading required package: magrittr
library("Publish")
## Loading required package: prodlim
  1. Read in the data from the internet:
hmohiv<-read.table("https://stats.idre.ucla.edu/stat/r/examples/asa/hmohiv.csv", sep=",", header = TRUE) 
  1. There are the following variables in the data: age, time, drug, entry and end date.
  2. Convert entry and end dates to R dates with the following code: as.Date(hmohiv$entdate,format=“%m/%d/%Y”).
hmohiv$entdate <- as.Date(hmohiv$entdate,format="%m/%d/%Y")
hmohiv$enddate <- as.Date(hmohiv$enddate,format="%m/%d/%Y")
  1. Create a histogram of entry dates with ggplot2.
ggplot(hmohiv,aes(entdate)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Compute the follow up time in days and compare with the “time” variable. Hint: just subtract, but convert into numeric (use as.numeric). What is the relationship between the time difference we computed and the “time” variable?
hmohiv$obstime = as.numeric(hmohiv$enddate-hmohiv$entdate)
  1. What is the date range of follow-up, entry and end times?
range(hmohiv$obstime)
## [1]   30 1825
range(hmohiv$entdate)
## [1] "1989-01-13" "1991-12-28"
range(hmohiv$enddate)
## [1] "1989-02-16" "1995-11-14"
  1. Is survival time normally distributed?
ggplot(hmohiv,aes(obstime)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Use Surv(hmohiv\(time,hmohiv\)censor) to create a survival variable (survVar). Create a Kaplan-Meier curve with survfit(survVar~1, …..), and plot with ggsurvplot. Try “risk.table=TRUE” argument.
hmohiv$survVar <- Surv(hmohiv$time,hmohiv$censor)
KM <- survfit(survVar ~ 1,  type="kaplan-meier", conf.type="log", data=hmohiv)
ggsurvplot(survfit(survVar~drug,data=hmohiv),risk.table=TRUE, pval = TRUE)

  1. How many events in each group?
with(hmohiv,table(hmohiv$time,hmohiv$censor))
##     
##       0  1
##   1   2 15
##   2   5  5
##   3   2 10
##   4   1  4
##   5   0  7
##   6   1  2
##   7   1  6
##   8   0  4
##   9   0  3
##   10  1  3
##   11  0  3
##   12  2  2
##   13  0  1
##   14  0  1
##   15  0  2
##   19  1  0
##   22  0  1
##   24  1  0
##   30  0  1
##   31  0  1
##   32  0  1
##   34  0  1
##   35  0  1
##   36  0  1
##   43  0  1
##   53  0  1
##   54  0  1
##   56  1  0
##   57  0  1
##   58  0  1
##   60  2  0
  1. What were the median survival times with 95% Cis in each group? Hint: summary(fit)$table
median(summary(KM$surv))
## [1] 0.2649738
  1. Use survdiff() to compute the logrank test.
survdiff(survVar~censor, data = hmohiv)
## Call:
## survdiff(formula = survVar ~ censor, data = hmohiv)
## 
##           N Observed Expected (O-E)^2/E (O-E)^2/V
## censor=0 20        0     18.3     18.32      26.6
## censor=1 80       80     61.7      5.44      26.6
## 
##  Chisq= 26.6  on 1 degrees of freedom, p= 2e-07
  1. Was the difference statistically significant?
survdiff(survVar~censor, data = hmohiv)
## Call:
## survdiff(formula = survVar ~ censor, data = hmohiv)
## 
##           N Observed Expected (O-E)^2/E (O-E)^2/V
## censor=0 20        0     18.3     18.32      26.6
## censor=1 80       80     61.7      5.44      26.6
## 
##  Chisq= 26.6  on 1 degrees of freedom, p= 2e-07
  1. Use coxph function to estimate the hazard ratio (HR). Give the 95% CI and interpret the parameter. Remember to exponeniate!
Cox1 <- coxph(survVar ~ age, data = hmohiv)
Cox1
## Call:
## coxph(formula = survVar ~ age, data = hmohiv)
## 
##        coef exp(coef) se(coef)     z        p
## age 0.08529   1.08903  0.01745 4.888 1.02e-06
## 
## Likelihood ratio test=23.34  on 1 df, p=1.359e-06
## n= 100, number of events= 80
  1. Create age group variable for each decade from 20 to 60 using the “cut()” function.
cut(hmohiv$age, breaks = 20:60)
##   [1] (45,46] (34,35] (29,30] (29,30] (35,36] (31,32] (35,36] (30,31]
##   [9] (47,48] (46,47] (27,28] (33,34] (43,44] (31,32] (35,36] (35,36]
##  [17] (53,54] (34,35] (43,44] (37,38] (39,40] (33,34] (24,25] (31,32]
##  [25] (41,42] (46,47] (29,30] (46,47] (40,41] (39,40] (42,43] (40,41]
##  [33] (29,30] (36,37] (41,42] (30,31] (38,39] (31,32] (50,51] (35,36]
##  [41] (42,43] (38,39] (32,33] (44,45] (32,33] (27,28] (30,31] <NA>   
##  [49] (43,44] (38,39] (32,33] (30,31] (32,33] (49,50] (35,36] (29,30]
##  [57] (41,42] (31,32] (33,34] (37,38] (32,33] (38,39] (38,39] (32,33]
##  [65] (33,34] (33,34] (45,46] (21,22] (43,44] (36,37] (24,25] (37,38]
##  [73] (31,32] (33,34] (28,29] (35,36] (20,21] (25,26] (31,32] (41,42]
##  [81] (39,40] (36,37] (46,47] (31,32] (40,41] (45,46] (25,26] (29,30]
##  [89] (31,32] (30,31] (34,35] (35,36] (40,41] (35,36] (34,35] (33,34]
##  [97] (27,28] (28,29] (34,35] (33,34]
## 40 Levels: (20,21] (21,22] (22,23] (23,24] (24,25] (25,26] ... (59,60]
  1. Plot the relationship between decade and survival. Hint: use survfit.
ggsurvplot(survfit(survVar~censor,data=hmohiv),risk.table=TRUE, pval = TRUE)

  1. Was age a factor in survival? What is the HR?
ggsurvplot(survfit(survVar~age,data=hmohiv), pval = TRUE)

  1. Use coxph to simultaneously estimate HR due to age & drug and report CI and p-values.
Cox2 <- coxph(survVar ~ age + drug, data = hmohiv)
Cox2
## Call:
## coxph(formula = survVar ~ age + drug, data = hmohiv)
## 
##         coef exp(coef) se(coef)     z        p
## age  0.09714   1.10202  0.01864 5.211 1.87e-07
## drug 1.01670   2.76405  0.25622 3.968 7.24e-05
## 
## Likelihood ratio test=39.13  on 2 df, p=3.182e-09
## n= 100, number of events= 80