October 12, 2017

Outline

Get familiar with R (in RStudio)

Install R and RStudio

  • Install R:
    • go to the home page https://www.r-project.org/;
    • click download CRAN in the left bar;
    • choose a download site;
    • choose your operating system;
    • select base for Windows, or the last .pkg for Mac, and download.
  • Install RStudio:
    • go to the home page https://www.rstudio.com/;
    • click Download RStudio;
    • click Download RStudio Desktop;
    • click Recommended For Your System;
    • download the .exe or .dmg file and run it.

RStudio interface

  • Bottom left: console (or command) window.
    Here you can type simple commands after the ">" prompt and R will then execute your command.

  • Top left: editor (or script) window.
    Collections of commands (scripts) can be edited and saved. If you want to run a line from the script window you can click Run or press CTRL+ENTER or CMD+ENTER to send it to the command window.

  • Top right: environment/history window.
    In the environment window you can see which data and values R has in its memory. The history window shows what has been typed before.

  • Bottom right: files/plots/packages/help window.
    Here you can open files, view plots, install and load packages or use the help function.

Packages and libraries

R can do many statistical and data analyses. They are organized in so-called packages or libraries. Most common packages are already installed.

Many more packages are available on CRAN.

For example, to install and use the tidyverse package type in the command line or execute them from the editor/script

# install the package
install.packages("tidyverse")
# load the package
library(tidyverse)

R packages for data science (manipulate and visualize data)

R for Data Science

tidyverse R package

The tidyverse is a set of packages that work in harmony because they share common data representations and API design.

It includes:

  • ggplot2, for data visualization
  • dplyr, for data manipulation
  • tidyr, for data tidying
  • readr, for data import
  • purrr, for functional programming
  • tibble, for tibbles, a modern re-imagining of data frames
  • and many more …

Motivating example

Data-set (oralca.txt)

orca = read.table("http://www.stats4life.se/data/oralca.txt")
orca = as_data_frame(orca)

Reference (article)
Läärä E, Korpi JT, Pitkänen H, Alho OP, Kantola S. Competing risks analysis of cause‐specific mortality in patients with oral squamous cell carcinoma. Head & neck. 2017 Jan 1;39(1):56-62.

Descriptive abstract
A population-based retrospective cohort design, including patients diagnosed with an OSCC cancer between January 1, 1985 and December 31, 2005 from the 2 northernmost provinces of Finland. The classification of deaths into the 2 categories in this study: (1) deaths from OSCC; and (2) deaths of other causes, was based on the Finnish Cancer Registry.

Read data from different format

# reads a file in table format (.csv) into data.frame
orca = read.csv("http://www.stats4life.se/data/oralca.csv")
# read txt into a data.frame
orca = read.table("http://www.stats4life.se/data/oralca.txt")
# reads a delimited file (.csv and .tsc) into tibble
library(readr)
orca = read_csv("http://www.stats4life.se/data/oralca.csv")
library(haven)
# read a Stata data set into tibble
orca = read_dta("http://www.stats4life.se/data/oralca.dta")
# read a SPSS data set into tibble
orca = read_sav("http://www.stats4life.se/data/oralca.sav")
# read a SAS data set into tibble
orca = read_sas("http://www.stats4life.se/data/oralca.sas7bdat")
library(readxl)
# download and read an Excel sheet into tibble
download.file("http://www.stats4life.se/data/oralca.xlsx", destfile = "oralca.xlsx")
orca = read_excel("oralca.xlsx")
# load data in R format
load(url("http://www.stats4life.se/data/orcalca.rda"))

Exploring oralca data

# print data in the console
orca
# A tibble: 338 x 6
      id    sex      age  stage   time          event
 * <int> <fctr>    <dbl> <fctr>  <dbl>         <fctr>
 1     1   Male 65.42274   unkn  5.081          Alive
 2     2 Female 83.08783    III  0.419 Oral ca. death
 3     3   Male 52.59008     II  7.915    Other death
 4     4   Male 77.08630      I  2.480    Other death
 5     5   Male 80.33622     IV  2.500 Oral ca. death
 6     6 Female 82.58132     IV  0.167    Other death
 7     7   Male 73.07524     II  5.925          Alive
 8     8   Male 74.49622   unkn  1.503 Oral ca. death
 9     9 Female 56.49984     IV 13.333 Oral ca. death
10    10   Male 49.49075     II  7.666          Alive
# ... with 328 more rows
# dim = number of row anxd number of columns
dim(orca)
[1] 338   6
c("Number of obs:" = nrow(orca), "Number of variables:" = ncol(orca))
      Number of obs: Number of variables: 
                 338                    6 

# More in general, get the structure of the data
str(orca)
Classes 'tbl_df', 'tbl' and 'data.frame':   338 obs. of  6 variables:
 $ id   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ sex  : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 1 2 2 1 2 ...
 $ age  : num  65.4 83.1 52.6 77.1 80.3 ...
 $ stage: Factor w/ 5 levels "I","II","III",..: 5 3 2 1 4 4 2 5 4 2 ...
 $ time : num  5.081 0.419 7.915 2.48 2.5 ...
 $ event: Factor w/ 3 levels "Alive","Oral ca. death",..: 1 2 3 3 2 3 1 2 2 1 ...
# names of the variable
names(orca)
[1] "id"    "sex"   "age"   "stage" "time"  "event"
# a quick summary of the data
summary(orca)
       id             sex           age         stage         time                   event    
 Min.   :  1.00   Female:152   Min.   :15.15   I   :50   Min.   : 0.085   Alive         :109  
 1st Qu.: 85.25   Male  :186   1st Qu.:53.24   II  :77   1st Qu.: 1.333   Oral ca. death:122  
 Median :169.50                Median :64.86   III :72   Median : 3.869   Other death   :107  
 Mean   :169.50                Mean   :63.51   IV  :68   Mean   : 5.662                       
 3rd Qu.:253.75                3rd Qu.:74.29   unkn:71   3rd Qu.: 8.417                       
 Max.   :338.00                Max.   :92.24             Max.   :23.258                       

dplyr: grammar of data manipulation

https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html


5 functions (verbs) to manipulate data

  • Create new variables with functions of existing variables (mutate())
  • Pick variables by their names (select())
  • Pick observations by their values (filter())
  • Reorder the rows (arrange())
  • Collapse many values down to a single summary (summarise())

Add new columns with mutate()

# create an indicator variable for death in a new data-set called 'orca2'
orca2 = mutate(orca, dead = (event != "Alive"))
orca2
# A tibble: 338 x 7
      id    sex      age  stage   time          event  dead
   <int> <fctr>    <dbl> <fctr>  <dbl>         <fctr> <lgl>
 1     1   Male 65.42274   unkn  5.081          Alive FALSE
 2     2 Female 83.08783    III  0.419 Oral ca. death  TRUE
 3     3   Male 52.59008     II  7.915    Other death  TRUE
 4     4   Male 77.08630      I  2.480    Other death  TRUE
 5     5   Male 80.33622     IV  2.500 Oral ca. death  TRUE
 6     6 Female 82.58132     IV  0.167    Other death  TRUE
 7     7   Male 73.07524     II  5.925          Alive FALSE
 8     8   Male 74.49622   unkn  1.503 Oral ca. death  TRUE
 9     9 Female 56.49984     IV 13.333 Oral ca. death  TRUE
10    10   Male 49.49075     II  7.666          Alive FALSE
# ... with 328 more rows

Select columns with select()

# select only variables: age, sex, stage, time, and dead
select(orca2, age, sex, stage, time, dead)
# A tibble: 338 x 5
        age    sex  stage   time  dead
      <dbl> <fctr> <fctr>  <dbl> <lgl>
 1 65.42274   Male   unkn  5.081 FALSE
 2 83.08783 Female    III  0.419  TRUE
 3 52.59008   Male     II  7.915  TRUE
 4 77.08630   Male      I  2.480  TRUE
 5 80.33622   Male     IV  2.500  TRUE
 6 82.58132 Female     IV  0.167  TRUE
 7 73.07524   Male     II  5.925 FALSE
 8 74.49622   Male   unkn  1.503  TRUE
 9 56.49984 Female     IV 13.333  TRUE
10 49.49075   Male     II  7.666 FALSE
# ... with 328 more rows

Filter rows with filter()

# select only male patients with diagnosed stage IV
filter(orca2, sex == "Male", stage == "IV")
# A tibble: 44 x 7
      id    sex      age  stage  time          event  dead
   <int> <fctr>    <dbl> <fctr> <dbl>         <fctr> <lgl>
 1     5   Male 80.33622     IV 2.500 Oral ca. death  TRUE
 2    11   Male 53.08017     IV 2.834          Alive FALSE
 3    14   Male 74.08006     IV 4.668    Other death  TRUE
 4    27   Male 65.98675     IV 5.246    Other death  TRUE
 5    28   Male 57.31574     IV 1.815          Alive FALSE
 6    29   Male 74.40313     IV 0.747    Other death  TRUE
 7    37   Male 63.31179     IV 0.252 Oral ca. death  TRUE
 8    52   Male 54.06034     IV 5.248    Other death  TRUE
 9    57   Male 51.05684     IV 0.914    Other death  TRUE
10    62   Male 59.47049     IV 1.084 Oral ca. death  TRUE
# ... with 34 more rows

Arrange rows with arrange()

# order the data-set with increasing time, and descreasing age
arrange(orca2, time, desc(age))
# A tibble: 338 x 7
      id    sex      age  stage  time          event  dead
   <int> <fctr>    <dbl> <fctr> <dbl>         <fctr> <lgl>
 1   265   Male 62.62731     IV 0.085 Oral ca. death  TRUE
 2   227   Male 47.80966     IV 0.085 Oral ca. death  TRUE
 3    18 Female 90.74855     IV 0.162 Oral ca. death  TRUE
 4   229   Male 53.14588   unkn 0.162 Oral ca. death  TRUE
 5    79   Male 87.71219   unkn 0.167 Oral ca. death  TRUE
 6     6 Female 82.58132     IV 0.167    Other death  TRUE
 7   183   Male 77.32176    III 0.167    Other death  TRUE
 8   199   Male 72.40171   unkn 0.167 Oral ca. death  TRUE
 9   249 Female 81.71887      I 0.170    Other death  TRUE
10   190   Male 62.81897   unkn 0.170 Oral ca. death  TRUE
# ... with 328 more rows

The %>% (pipe) operator

Combine your code and make it readable

orca %>% 
  mutate(dead = (event != "Alive")) %>%
  select(age, sex, stage, time, dead) %>%
  filter(sex == "Male", stage == "IV") %>%
  arrange(time, desc(age)) %>% head()
# A tibble: 6 x 5
       age    sex  stage  time  dead
     <dbl> <fctr> <fctr> <dbl> <lgl>
1 62.62731   Male     IV 0.085  TRUE
2 47.80966   Male     IV 0.085  TRUE
3 65.27489   Male     IV 0.252  TRUE
4 63.31179   Male     IV 0.252  TRUE
5 68.43719   Male     IV 0.329  TRUE
6 70.81919   Male     IV 0.419  TRUE

ggplot2: grammar of graphics

http://www.cookbook-r.com/Graphs/

A statistical graphic is a mapping of data variables to aesthetic attributes of geometric objects.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
  1. data: the data-set comprised of variables that we map;
  2. geom: the geometric object in question, i.e. the objects we can observe in our plot (e.g. points, lines, bars, etc);
  3. aes: aesthetic attributes of the geometric object that we can perceive on a graphic (e.g. x/y position, color, shape, and size. Each assigned aesthetic attribute can be mapped to a variable in our data-set).

# an example
ggplot(orca, aes(x = id, y = time)) +
  geom_linerange(aes(ymin = 0, ymax = time)) +
  geom_point(aes(color = event, shape = event)) + coord_flip()

# another example
ggplot(data = orca, aes(x = age, y = time)) +
  geom_point(aes(shape = event, color = stage)) + 
  geom_smooth(se = FALSE)

ggplot behind the scene

geom_functions univariate

Obtain summary statistics and produce useful graphs

Quantitative variable (age at diagnosis)

# first 6 observations
head(orca$age)
[1] 65.42274 83.08783 52.59008 77.08630 80.33622 82.58132
# summary statistics (range, quartiles, and mean)
summary(orca$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.15   53.24   64.86   63.51   74.29   92.24 
# save sample size, mean, median, and standard deviation in a new object
stat_age = c(n = length(orca$age), mean = mean(orca$age), 
               median = median(orca$age), sd = sd(orca$age))
stat_age
        n      mean    median        sd 
338.00000  63.51398  64.85735  14.43288 

Most common graphical presentation: histogram (?geom_histogram) and …

ggplot(orca, aes(age)) +
  geom_histogram()

ggplot(orca, aes(x = time, y = ..density..)) +
  geom_histogram(bins = 25) +
  geom_line(stat = "density")

… box plot (?geom_boxplot)

ggplot(orca, aes(x = 1, y = age)) + geom_boxplot() +
  labs(x = "", y = "Age (yr)") + theme_classic()

Categorical variable (event)

Table of frequencies are the typical way to summarize a categorical variable

# summary(orca$event)
tab = table(orca$event)
tab

         Alive Oral ca. death    Other death 
           109            122            107 
# for proportions
round(prop.table(tab), 2)

         Alive Oral ca. death    Other death 
          0.32           0.36           0.32 

Most common graphical presentation: bar plot (?geom_bar)

ggplot(orca, aes(event)) + geom_bar()


library(scales)
ggplot(orca, aes(event)) + 
  geom_bar(aes(y = ..count../sum(..count..))) +
  labs(x = "", y = "Frequency") +
  scale_y_continuous(labels = percent)

Bivariate association (age and stage)

table(orca$stage)

   I   II  III   IV unkn 
  50   77   72   68   71 
orca %>% 
  group_by(stage) %>%
  summarize(mean = mean(age), se = sd(age))
# A tibble: 5 x 3
   stage     mean       se
  <fctr>    <dbl>    <dbl>
1      I 62.88139 14.87824
2     II 64.63002 14.63587
3    III 61.69197 13.64211
4     IV 64.20545 13.77372
5   unkn 63.93454 15.46824

Graphical presentation: box plot

ggplot(orca, aes(x = stage, y = age)) + 
  geom_boxplot() + coord_flip()

mod = lm(age ~ stage, data = orca)
summary(mod)

Call:
lm(formula = age ~ stage, data = orca)

Residuals:
   Min     1Q Median     3Q    Max 
-47.73 -10.22   1.13  10.45  28.30 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   62.881      2.047  30.712   <2e-16 ***
stageII        1.749      2.630   0.665    0.507    
stageIII      -1.189      2.665  -0.446    0.656    
stageIV        1.324      2.697   0.491    0.624    
stageunkn      1.053      2.673   0.394    0.694    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.48 on 333 degrees of freedom
Multiple R-squared:  0.005698,  Adjusted R-squared:  -0.006245 
F-statistic: 0.4771 on 4 and 333 DF,  p-value: 0.7526

Graphically compare the distributions

ggplot(orca, aes(x = age, col = stage)) +
  geom_line(stat = "density")


ggplot(orca, aes(x = age, group = stage)) +
  geom_line(stat = "density") +
  facet_grid(stage ~ .)

Bivariate association (sex and dead)

# create binary variable 'dead' (0 alive 1 dead)
orca = mutate(orca, dead = factor(event != "Alive", labels = c("Alive", "Dead")))
tab2 = with(orca, table(sex, dead))
addmargins(tab2)
        dead
sex      Alive Dead Sum
  Female    50  102 152
  Male      59  127 186
  Sum      109  229 338
# get proportion (margin: 1 by row, 2 by col)
100*prop.table(tab2, margin = 2)
        dead
sex         Alive     Dead
  Female 45.87156 44.54148
  Male   54.12844 55.45852
# Test the association
chisq.test(tab2)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tab2
X-squared = 0.012725, df = 1, p-value = 0.9102

Check out the Epi package

library(Epi)
stat.table(list(sex, dead), 
           list(count(), percent(sex)), 
           orca, margin = T)
 --------------------------------- 
         ----------dead----------- 
 sex        Alive    Dead   Total  
 --------------------------------- 
 Female        50     102     152  
             45.9    44.5    45.0  
                                   
 Male          59     127     186  
             54.1    55.5    55.0  
                                   
                                   
 Total        109     229     338  
            100.0   100.0   100.0  
 --------------------------------- 

with(orca, twoby2(sex, relevel(dead, 2)))
2 by 2 table analysis: 
------------------------------------------------------ 
Outcome   : Dead 
Comparing : Female vs. Male 

       Dead Alive    P(Dead) 95% conf. interval
Female  102    50     0.6711    0.5926   0.7410
Male    127    59     0.6828    0.6125   0.7456

                                    95% conf. interval
             Relative Risk:  0.9828    0.8474   1.1399
         Sample Odds Ratio:  0.9477    0.5994   1.4984
Conditional MLE Odds Ratio:  0.9479    0.5845   1.5396
    Probability difference: -0.0117   -0.1118   0.0870

             Exact P-value: 0.9069 
        Asymptotic P-value: 0.8183 
------------------------------------------------------

Or alternatively with the epitools package

library(epitools)
with(orca, epitab(table(dead, sex), verbose = T))

Graphical presentation: bar plot

ggplot(orca, aes(x = dead, group = sex)) + 
  geom_bar(aes(y = ..prop.., fill = sex), 
           position = "dodge") + 
  scale_y_continuous(labels = percent)

ggplot(orca, aes(x = dead, group = sex)) + 
  geom_bar(aes(y = ..prop..), position = "dodge") +
  facet_grid(~ sex) +
  scale_y_continuous(labels = percent)

If time to event was not of interest (not this case)

mod_bin = glm(dead ~ relevel(sex, 2), data = orca, family = "binomial")
# presenting odds ratio
ci.exp(mod_bin)
                      exp(Est.)     2.5%    97.5%
(Intercept)           2.1525424 1.580672 2.931309
relevel(sex, 2)Female 0.9477165 0.599421 1.498390
# predicted probabilities
orca %>%
  mutate(prob = predict(mod_bin, type = "response")) %>%
  head()
# A tibble: 6 x 8
     id    sex      age  stage  time          event   dead      prob
  <int> <fctr>    <dbl> <fctr> <dbl>         <fctr> <fctr>     <dbl>
1     1   Male 65.42274   unkn 5.081          Alive  Alive 0.6827957
2     2 Female 83.08783    III 0.419 Oral ca. death   Dead 0.6710526
3     3   Male 52.59008     II 7.915    Other death   Dead 0.6827957
4     4   Male 77.08630      I 2.480    Other death   Dead 0.6827957
5     5   Male 80.33622     IV 2.500 Oral ca. death   Dead 0.6827957
6     6 Female 82.58132     IV 0.167    Other death   Dead 0.6710526

Multivariate association (sex, age, and dead)

orca %>%
  group_by(sex, dead) %>%
  summarise(mean = mean(age), sd = sd(age))
# A tibble: 4 x 4
# Groups:   sex [?]
     sex   dead     mean       sd
  <fctr> <fctr>    <dbl>    <dbl>
1 Female  Alive 61.12249 14.23391
2 Female   Dead 71.21227 12.74809
3   Male  Alive 51.94397 14.06550
4   Male   Dead 63.64768 12.00282
stat.table(index = list(sex, dead),
           contents = list(mean(age), sd(age)),
           data = orca, margins = T)
 --------------------------------- 
         ----------dead----------- 
 sex        Alive    Dead   Total  
 --------------------------------- 
 Female     61.12   71.21   67.89  
            14.23   12.75   14.04  
                                   
 Male       51.94   63.65   59.94  
            14.07   12.00   13.78  
                                   
                                   
 Total      56.15   67.02   63.51  
            14.81   12.88   14.43  
 --------------------------------- 

ggplot(orca, aes(x = age, col = dead)) + 
  geom_line(stat = "density") +
  facet_grid(~ sex)

Example of survival analysis (more material can be found here)

Representing survival data

library(plotly)
ggplotly(
  ggplot(orca, aes(x = id, y = time)) + geom_linerange(aes(ymin = 0, ymax = time)) +
    geom_point(aes(shape = event, color = event)) + scale_shape_manual(values = c(1, 3, 4)) +
    labs(y = "Time (years)", x = "Subject ID") + coord_flip() + theme_classic()
)

Creating a survival object

The Surv() function creates a survival object, i.e. the time variable and the failure or status indicator

library(survival)
su_obj = Surv(orca$time, orca$dead == "Dead")
head(su_obj, n = 20)
 [1]  5.081+  0.419   7.915   2.480   2.500   0.167   5.925+  1.503  13.333   7.666+  2.834+  0.914   3.083   4.668 
[15]  4.816   2.166   5.837   0.162   3.666   0.504 

The created survival object is then used as response variable in specific functions for survival analysis.

Estimating the Survival Function (Kaplan–Meier)

The survfit() function by the default estimate then survival curve using the Kaplan–Meier estimator

fit_km = survfit(su_obj ~ 1, data = orca)
print(fit_km, print.rmean = TRUE)
Call: survfit(formula = su_obj ~ 1, data = orca)

         n     events     *rmean *se(rmean)     median    0.95LCL    0.95UCL 
   338.000    229.000      8.060      0.465      5.418      4.331      6.916 
    * restricted mean with upper limit =  23.3 

The fortify() function of the ggfortify package is useful to extract the whole survival table in a data.frame.

library(ggfortify)
fortify(fit_km) %>% head()
   time n.risk n.event n.censor      surv     std.err     upper     lower
1 0.085    338       2        0 0.9940828 0.004196498 1.0000000 0.9859401
2 0.162    336       2        0 0.9881657 0.005952486 0.9997618 0.9767041
3 0.167    334       4        0 0.9763314 0.008468952 0.9926726 0.9602592
4 0.170    330       2        0 0.9704142 0.009497400 0.9886472 0.9525175
5 0.246    328       1        0 0.9674556 0.009976176 0.9865584 0.9487228
6 0.249    327       1        0 0.9644970 0.010435745 0.9844277 0.9449699

Change the argument fun = events for the cumulative proportion, cumhaz for the cumulative hazard, cloglog for the complementary log−log

library(survminer)
ggsurvplot(fit_km, risk.table = TRUE, fun = "pct", censor = T, 
           xlab = "Time (years)", palette = "black", legend = "none")

Measures of central tendency

mc = data.frame(q = c(.25, .5, .75), km = quantile(fit_km))
mc
      q km.quantile km.lower km.upper
25 0.25       1.333    1.084    1.834
50 0.50       5.418    4.331    6.916
75 0.75      13.673   11.748   16.580

ggsurvplot(fit_km, xlab = "Time (years)", censor = F, legend = "none")$plot +
  geom_segment(data = mc, aes(x = km.quantile, y = 1-q, xend = km.quantile, 
                              yend = 0), lty = 2) +
  geom_segment(data = mc, aes(x = 0, y = 1-q, xend = km.quantile, yend = 1-q), lty = 2)

Estimating the Survival Function (parametric)

Check ?flexsurvreg for a description of the different distributions

library(flexsurv)
fit_wei = flexsurvreg(su_obj ~ 1, data = orca, dist = "weibull")
fit_wei
Call:
flexsurvreg(formula = su_obj ~ 1, data = orca, dist = "weibull")

Estimates: 
       est    L95%   U95%   se   
shape  0.821  0.737  0.914  0.045
scale  8.426  7.193  9.871  0.680

N = 338,  Events: 229,  Censored: 109
Total time at risk: 1913.673
Log-likelihood = -708.0542, df = 2
AIC = 1420.108
data.frame(summary(fit_wei)) %>% head()
   time       est       lcl       ucl
1 0.085 0.9772491 0.9672428 0.9844084
2 0.162 0.9616852 0.9471458 0.9725004
3 0.167 0.9607367 0.9459484 0.9717434
4 0.170 0.9601706 0.9452349 0.9712906
5 0.246 0.9464460 0.9281915 0.9602414
6 0.249 0.9459254 0.9275537 0.9598219

library(gridExtra)
grid.arrange(
  ggplot(data.frame(summary(fit_wei)), aes(time, est)) + 
    geom_line() + labs(x = "Time (years)", y = "Survival"),
  ggplot(data.frame(summary(fit_wei, type = "hazard")), aes(time, est)) + 
    geom_line() + labs(x = "Time (years)", y = "Hazard"), ncol = 2
)

Comparison of survival curves

su_stg  = survfit(su_obj ~ stage, data = orca)
su_stg
Call: survfit(formula = su_obj ~ stage, data = orca)

            n events median 0.95LCL 0.95UCL
stage=I    50     25  10.56    6.17      NA
stage=II   77     51   7.92    4.92   13.34
stage=III  72     51   7.41    3.92    9.90
stage=IV   68     57   2.00    1.08    4.82
stage=unkn 71     45   3.67    2.83    8.17
survdiff(su_obj ~ stage, data = orca)
Call:
survdiff(formula = su_obj ~ stage, data = orca)

            N Observed Expected (O-E)^2/E (O-E)^2/V
stage=I    50       25     39.9     5.573     6.813
stage=II   77       51     63.9     2.606     3.662
stage=III  72       51     54.1     0.174     0.231
stage=IV   68       57     33.2    16.966    20.103
stage=unkn 71       45     37.9     1.346     1.642

 Chisq= 27.2  on 4 degrees of freedom, p= 1.78e-05 

Change the argument fun = events to compare cumulative proportions, cumhaz for the cumulative hazards, cloglog for the complementary log−log

ggsurvplot(su_stg, fun = "event", censor = F, xlab = "Time (years)")

Modeling survival data (Cox PH model)

m1 = coxph(su_obj ~ sex + I((age-65)/10) + stage, data = orca)
ci.exp(m1)
                 exp(Est.)      2.5%    97.5%
sexMale           1.421036 1.0770928 1.874809
I((age - 65)/10)  1.515929 1.3572535 1.693156
stageII           1.035537 0.6385626 1.679298
stageIII          1.412619 0.8727664 2.286398
stageIV           2.423992 1.5063276 3.900703
stageunkn         1.793932 1.0963386 2.935398
# check proportionallity of the hazard
cox.zph(m1)
                      rho    chisq     p
sexMale          -0.00137 0.000439 0.983
I((age - 65)/10)  0.07539 1.393597 0.238
stageII          -0.04208 0.411652 0.521
stageIII         -0.06915 1.083755 0.298
stageIV          -0.10044 2.301780 0.129
stageunkn        -0.09663 2.082042 0.149
GLOBAL                 NA 4.895492 0.557

newd = data.frame(sex = "Male", age = 40, stage = "IV")
fortify(survfit(m1, newdata = newd)) %>%
  ggplot(aes(x = time, y = surv)) +
  geom_step() + geom_ribbon(aes(ymin = lower, ymax = upper), alpha = .2) +
  labs(x = "Time (years)", y = "Survival probability") + theme_classic()

Get interactive!

library(shiny)
shinyApp(
  ui = fluidPage(
   titlePanel("Prediction from Cox model"),
   sidebarLayout(
      sidebarPanel(
        sliderInput("age", "Select age:", min = 50, max = 75, value = 60),
        selectInput("stage", "Select tumoral stage:", levels(orca$stage), "IV"),
        radioButtons("sex", "Select sex:", levels(orca$sex), "Male")
      ),
      mainPanel(
        plotOutput("surv")
      )
   )),
  server = function(input, output){
   output$surv = renderPlot({
     newd = reactive({
       data.frame(sex = input$sex, age = input$age, stage = input$stage)
     })
     fortify(survfit(m1, newdata = newd())) %>%
       ggplot(aes(x = time, y = surv)) +
       geom_step() + geom_ribbon(aes(ymin = lower, ymax = upper), alpha = .2) +
       labs(x = "Time (years)", y = "Survival probability") + 
       theme_classic()
   })
   }
)

Modeling survival data (AFT model)

m2 = flexsurvreg(su_obj ~ sex + I((age-65)/10) + stage, data = orca, dist = "weibull")
m2
Call:
flexsurvreg(formula = su_obj ~ sex + I((age - 65)/10) + stage, 
    data = orca, dist = "weibull")

Estimates: 
                  data mean  est      L95%     U95%     se       exp(est)  L95%     U95%   
shape                  NA     0.9181   0.8272   1.0191   0.0489       NA        NA       NA
scale                  NA    14.5312   9.1847  22.9898   3.4012       NA        NA       NA
sexMale            0.5503    -0.4276  -0.7283  -0.1269   0.1534   0.6521    0.4827   0.8808
I((age - 65)/10)  -0.1486    -0.4613  -0.5817  -0.3409   0.0614   0.6305    0.5590   0.7111
stageII            0.2278    -0.0144  -0.5397   0.5109   0.2680   0.9857    0.5829   1.6668
stageIII           0.2130    -0.3538  -0.8783   0.1706   0.2676   0.7020    0.4155   1.1860
stageIV            0.2012    -0.9674  -1.4889  -0.4458   0.2661   0.3801    0.2256   0.6403
stageunkn          0.2101    -0.6513  -1.1863  -0.1163   0.2730   0.5214    0.3053   0.8902

N = 338,  Events: 229,  Censored: 109
Total time at risk: 1913.673
Log-likelihood = -662.2349, df = 8
AIC = 1340.47

newd = data_frame(sex = "Male", age = 40, stage = "IV")
data.frame(unname(summary(m2, newdata = newd))) %>%
  ggplot(aes(x = time, y = est)) +
  geom_step() + geom_ribbon(aes(ymin = lcl, ymax = ucl), alpha = .2) +
  labs(x = "Time (years)", y = "Survival probability") + 
  theme_classic()