Survival Analysis

First we need to become familiaw with how to perform the survival function in R. We need to install the required packages survival and dplyr.

install.packages("survival")
install.packages("dplyr")

# Add packages to Library
library("survival")
library("dplyr")

Survival and Hazard Function

Now in our example we will explore the time until customers’ phone contract cancellations. let’s create two vectors. The first vector is for the recorded time period and the second for the event. In this case the time is the period where a customer has cancelled its contract and the event “1” is the event happening (contract cancellation). In this example, eventually all customers have cancelled their contracts.In the following example we will consider also tihe case of censoring data.

# Lecture Simple example
time <- c(3,7,16,25,32,44,49,50,55,72)
event <- c(1,1,1,1,1,1,1,1,1,1)

The Surv() function from the {survival} package creates a survival object that we will use as input in most of the functions in survival analysis. To create a survival object we need two arguments the time and the status indicator (event happened or not).

The Kaplan-Meier method is the most common way to estimate survival times and probabilities. The survfit() function creates survival curves using the Kaplan-Meier method based on a formula. The syntax is the survival object we have created followed by ~1.

# Create a survival object
Ex_Surv <- Surv(time, event)
Ex_survfit <- survfit(Ex_Surv ~ 1)

Kaplan-Meier plot

Given that we have done the survival function we can now plot the Kaplan-Meier plot. One way we can do this is with the plot function.

# Create a Kaplan-Meier plot
plot(Ex_survfit, conf.int="none", mark.time=TRUE, xlab="Time (weeks)", ylab="Proportion Survival")

Censoring

Let’s take the case of censoring data. We will create a slightly different dataset. As previously we have the time of an event happening (event “1”) or the a customer left the study (event “0”). If the event is “1” that means that the event happened to the customer at the time period found in the time vector. If it zero that means that the customer left the program and until that time period we do not actually know if the event have happened to them or not.

# Lecture Simple example 2
time <- c(3,7,16,25,32,44,49,50,55,72)
event <- c(1,0,1,1,0,1,1,0,1,1)
# Create a survival object
Ex_Surv <- Surv(time, event)
Ex_Surv
##  [1]  3   7+ 16  25  32+ 44  49  50+ 55  72

If you call Ex_Surv you will see that for the censored observation the time period has a + indicator. The syntax for the rest steps is as before.

Ex_survfit <- survfit(Ex_Surv ~ 1)

Kaplan-Meier plot with censoring

# Create a Kaplan-Meier plot
plot(Ex_survfit, conf.int=TRUE, mark.time=TRUE, xlab="Time (weeks)", ylab="Proportion Survival")

Log-rank test comparison of survival function.

Let’s see now how we can perform the log-rank test comparison in R. The log-rank test which is used to compare two separate survival curves from two separate population and identify if there is a statistically significant different between two groups.

The null hypothesis for log-rank test is that the two survival function is equal: S1(t) = S2(t). If you can reject the null hypothesis then you can said that there is a different between two functions. However,you can only compare 2 groups, not more! The log-rank test is a form of the Chi-squared test.

For the purpose of the seminar we will use a built in dataset(AML). This is about the survival in patients with Acute Myelogenous Leukemia. In the dataset we have the time, the censored status (“1” event happened “0” censored) and whether maintained chemotherapy was given. Initially we will follow the same steps as before.

# Acute Myelogenous Leukemia survival example - data included in the survival package

data <- data.frame(aml)
head(data)
##   time status          x
## 1    9      1 Maintained
## 2   13      1 Maintained
## 3   13      0 Maintained
## 4   18      1 Maintained
## 5   23      1 Maintained
## 6   28      0 Maintained
AML_Surv <- Surv(aml$time, aml$status)

In the previous example we used the syntax ~1 but now we will compare the two groups that are found in x field (maintained chemotherapy or not). After we run this we will have two survival functions, one for each group. To perform the log-rank test we will use the survdiff() function.

AML_survfit <- survfit(AML_Surv ~ aml$x)
plot(AML_survfit, conf.int=FALSE, mark.time=TRUE, xlab="Time", ylab="Proportion Survival")

survdiff(AML_Surv~x, data=aml)
## Call:
## survdiff(formula = AML_Surv ~ x, data = aml)
## 
##                  N Observed Expected (O-E)^2/E (O-E)^2/V
## x=Maintained    11        7    10.69      1.27       3.4
## x=Nonmaintained 12       11     7.31      1.86       3.4
## 
##  Chisq= 3.4  on 1 degrees of freedom, p= 0.07

Cox regression model of variable that influence survival function.

Cox regression model it can be used to measured association between variable and survivability of a population. The Cox proportional-hazards model is essentially a regression model used to investigate the association between the time to an event of interest to a set of variable that might have an effect on that time. It is also commonly found in medical research for investigating link between survival time of patients and one or more predictor variables. In business, it can be used to investigate link between time to an event of interest such as ‘insurance claim’, ‘purchase’, ‘churn’ against variable such as ‘gender’, ‘type of advertisement that brought customer in’, ‘income’, etc.

This is the example from the lecture. This is a built in dataset from the library boot about patients with malignent melanoma. There are variables about the sex,age, year of operations, thickness or tumor etc.

The first thing is to mutate the status to alive or not alive.

melanoma <- boot::melanoma
melanoma <- melanoma%>%mutate(status=if_else(status == 2, 0, # "still alive"
                        if_else(status == 1, 1, # "died of melanoma"
                               0)))
model <- coxph(Surv(time, status) ~ ., data = melanoma)
summary(model)
## Call:
## coxph(formula = Surv(time, status) ~ ., data = melanoma)
## 
##   n= 205, number of events= 57 
## 
##                coef exp(coef)  se(coef)      z Pr(>|z|)    
## sex        0.448121  1.565368  0.266861  1.679 0.093107 .  
## age        0.016805  1.016947  0.008578  1.959 0.050094 .  
## year      -0.102566  0.902518  0.061007 -1.681 0.092719 .  
## thickness  0.100312  1.105516  0.038212  2.625 0.008660 ** 
## ulcer      1.194555  3.302087  0.309254  3.863 0.000112 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## sex          1.5654     0.6388    0.9278     2.641
## age          1.0169     0.9833    1.0000     1.034
## year         0.9025     1.1080    0.8008     1.017
## thickness    1.1055     0.9046    1.0257     1.191
## ulcer        3.3021     0.3028    1.8012     6.054
## 
## Concordance= 0.757  (se = 0.031 )
## Likelihood ratio test= 44.4  on 5 df,   p=2e-08
## Wald test            = 40.89  on 5 df,   p=1e-07
## Score (logrank) test = 48.14  on 5 df,   p=3e-09

The function of cox regression model is coxph. The output is similar to the regression and we can understand which variables are significant.

model <- coxph(Surv(time, status) ~ ., data = melanoma)
summary(model)
## Call:
## coxph(formula = Surv(time, status) ~ ., data = melanoma)
## 
##   n= 205, number of events= 57 
## 
##                coef exp(coef)  se(coef)      z Pr(>|z|)    
## sex        0.448121  1.565368  0.266861  1.679 0.093107 .  
## age        0.016805  1.016947  0.008578  1.959 0.050094 .  
## year      -0.102566  0.902518  0.061007 -1.681 0.092719 .  
## thickness  0.100312  1.105516  0.038212  2.625 0.008660 ** 
## ulcer      1.194555  3.302087  0.309254  3.863 0.000112 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## sex          1.5654     0.6388    0.9278     2.641
## age          1.0169     0.9833    1.0000     1.034
## year         0.9025     1.1080    0.8008     1.017
## thickness    1.1055     0.9046    1.0257     1.191
## ulcer        3.3021     0.3028    1.8012     6.054
## 
## Concordance= 0.757  (se = 0.031 )
## Likelihood ratio test= 44.4  on 5 df,   p=2e-08
## Wald test            = 40.89  on 5 df,   p=1e-07
## Score (logrank) test = 48.14  on 5 df,   p=3e-09

Surviving the Chart: Analysis of Musician Longevity in UK Top 100 Chart in R

In this case, we will be using a real dataset of the maximum time period that a music pieces remains on Spotify UK Top 200 Charts (https://spotifycharts.com/regional/gb/daily/latest) from 2017 and 2018. The reason that we are interested in this is because entertainment company like film and music label has a keen interest on having a products on them. It is considered to be a clear criteria of a successful products.

Data: Overview and Preperation

Definition and Overview

To import the data into R, put the four CSV files into your R working directory then use the following

Spotify = read.csv("Spotify_Chart.csv")
Columns Description
URL Spotify URL
Track.Name Song name as appeared on Spotify
Artist Artist name as appeared on Spotify
time Maximum length on Charts
delta If the event of the song leaving the chart happened after last day of the year
Year Year on Chart

You can then explore the data using head() and summary()

head(Spotify)
##                                                     URL
## 1 https://open.spotify.com/track/000xQL6tZNLJzIrtIgxqSl
## 2 https://open.spotify.com/track/007d7JT41sSc1HqWTs4uw7
## 3 https://open.spotify.com/track/00l1uBtEO4WwmsfxqbeTWP
## 4 https://open.spotify.com/track/00lNx0OcTJrS3MKHcB80HY
## 5 https://open.spotify.com/track/0181HMomm7xM3Ks5YNlA9X
## 6 https://open.spotify.com/track/01uk7IzZyFNfQTDBXxx6NB
##                          Track.Name           Artist time delta Year
## 1                    Still Got Time             ZAYN  121     1 2017
## 2                       Love Myself Hailee Steinfeld    1     1 2017
## 3 Destination Calabria - Radio Edit     Alex Gaudino    1     1 2017
## 4    You Don't Know Me - Radio Edit        Jax Jones  110     0 2017
## 5           OMG (Young Money Remix)        Jae Millz    1     1 2017
## 6                     One Last Song        Sam Smith   51     0 2017
summary(Spotify)
##      URL             Track.Name           Artist               time    
##  Length:4855        Length:4855        Length:4855        Min.   :  1  
##  Class :character   Class :character   Class :character   1st Qu.:  1  
##  Mode  :character   Mode  :character   Mode  :character   Median :  2  
##                                                           Mean   : 21  
##                                                           3rd Qu.: 19  
##                                                           Max.   :364  
##      delta             Year     
##  Min.   :0.0000   Min.   :2017  
##  1st Qu.:1.0000   1st Qu.:2017  
##  Median :1.0000   Median :2018  
##  Mean   :0.9176   Mean   :2018  
##  3rd Qu.:1.0000   3rd Qu.:2018  
##  Max.   :1.0000   Max.   :2018

If you are interested, maybe try using filter() and summarize() in ‘dplyr’ to see which artists has the most amount of songs in the top 200 charts in 2017 and 2018.

Survival Function and Hazard Function

Firstly, using the ‘dplyr’ package, let’s extract the data on just 2017 first using:

Spotify_2017 <- Spotify %>% filter(Year == 2017)

We can then create a survival object with this data using:

Spotify_Surv <- Surv(Spotify_2017$time, Spotify_2017$delta)

From this survival object, we can create a Kaplan-Meier plot using:

Spotify_Survfit <- survfit(Spotify_Surv ~ 1)
plot(Spotify_Survfit, conf.int="none", mark.time=TRUE, xlab="Time (days)", ylab="Proportion
Survival")

We can view the stats on the survival object using:

Spotify_Survfit
## Call: survfit(formula = Spotify_Surv ~ 1)
## 
##         n events median 0.95LCL 0.95UCL
## [1,] 2372   2172      1       1       2
head(summary(Spotify_Survfit))
## $n
## [1] 2372
## 
## $time
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  59  60  61  62  63  64  66  68  69  70  71  72  73  74  75
##  [73]  76  77  78  79  80  82  83  84  86  87  88  90  92  93  96  97  98  99
##  [91] 100 102 109 110 111 114 116 117 118 121 128 131 135 137 144 145 146 149
## [109] 153 154 155 156 170 175 177 178
## 
## $n.risk
##   [1] 2372 1140 1060  968  897  837  767  725  698  683  672  656  635  602  593
##  [16]  575  570  555  548  546  528  519  512  498  494  472  456  442  423  367
##  [31]  359  356  349  345  343  339  333  326  324  318  315  304  302  298  296
##  [46]  292  290  284  277  275  273  270  266  263  256  252  246  243  236  235
##  [61]  231  227  223  221  218  213  212  204  200  195  193  192  191  187  183
##  [76]  179  173  170  167  165  164  161  156  128  125  123  118  113  110  109
##  [91]  108  105  101  100   97   96   91   90   87   84   75   72   68   65   62
## [106]   61   60   55   51   50   49   47   44   39   37   31
## 
## $n.event
##   [1] 1197   79   85   68   53   68   41   26   14   11   15   19   31    7   17
##  [16]    3   15    7    2   18    9    4   13    4   21   16   13   19   53    5
##  [31]    3    6    4    2    3    5    1    2    6    2   10    2    4    2    4
##  [46]    2    4    5    2    2    1    4    3    6    3    5    1    7    1    2
##  [61]    3    4    1    1    2    1    7    3    2    2    1    1    4    2    2
##  [76]    5    2    3    2    1    2    5   22    1    2    2    5    2    1    1
##  [91]    2    2    1    2    1    4    1    2    2    5    2    2    1    1    1
## [106]    1    1    1    1    1    1    1    3    1    2    1
## 
## $n.censor
##   [1] 35  1  7  3  7  2  1  1  1  0  1  2  2  2  1  2  0  0  0  0  0  3  1  0  1
##  [26]  0  1  0  3  3  0  1  0  0  1  1  6  0  0  1  1  0  0  0  0  0  2  2  0  0
##  [51]  2  0  0  1  1  1  0  2  0  2  1  0  0  3  3  0  1  1  3  0  0  0  0  2  2
##  [76]  1  0  1  0  0  1  0  5  2  1  2  1  1  0  0  1  0  2  1  0  0  1  1  0  4
## [101]  2  0  4  0  2  0  1  5  1  0  1  1  3  0  5  0
## 
## $surv
##   [1] 0.49536256 0.46103481 0.42406503 0.39427534 0.37097925 0.34083996
##   [7] 0.32262035 0.31105052 0.30481168 0.29990256 0.29320831 0.28471600
##  [13] 0.27081647 0.26766745 0.25999401 0.25863752 0.25183127 0.24865502
##  [19] 0.24774752 0.23958002 0.23549627 0.23368127 0.22774796 0.22591865
##  [25] 0.21631482 0.20898212 0.20302430 0.19429701 0.16995246 0.16763704
##  [31] 0.16623617 0.16343444 0.16156126 0.16062467 0.15921979 0.15687142
##  [37] 0.15640033 0.15544082 0.15256229 0.15160278 0.14678999 0.14582427
##  [43] 0.14389282 0.14292710 0.14099565 0.14002993 0.13809848 0.13566717
##  [49] 0.13468762 0.13370808 0.13321830 0.13124470 0.12976449 0.12680409
##  [55] 0.12531810 0.12283163 0.12233232 0.11880834 0.11830491 0.11729806
##  [61] 0.11577471 0.11373463 0.11322461 0.11271228 0.11167822 0.11115391
##  [67] 0.10748374 0.10590309 0.10484406 0.10376874 0.10323108 0.10269341
##  [73] 0.10054277 0.09946744 0.09838037 0.09563231 0.09452674 0.09285862
##  [79] 0.09174654 0.09119050 0.09007842 0.08728095 0.07497210 0.07438638
##  [85] 0.07319620 0.07200602 0.06895491 0.06773447 0.06711871 0.06650294
##  [91] 0.06527140 0.06402814 0.06339419 0.06212631 0.06148583 0.05892392
##  [97] 0.05827641 0.05698138 0.05567146 0.05235768 0.05096148 0.04954588
## [103] 0.04881727 0.04806623 0.04729097 0.04651571 0.04574045 0.04490880
## [109] 0.04402824 0.04314767 0.04226711 0.04136781 0.03854727 0.03755888
## [115] 0.03552867 0.03438259

As you can see, the median suggests 50% of songs does not last longer than 1 day on the chart and the survival rate after one day is less than 50%, which is not surprising given the competitive nature within music industry.

However, knowing the survival rate of music on a chart for a single year doesn’t give you a huge understanding of what happened or see how it differs from others. It would be more interesting to looks at the chart from the subsequent year to see if survival rate of songs on UK Top 200 Chart has change at all.

Challenge 1

Additional Tasks:

Comparision of Survival Function

Now let’s compare the survival rates of the two different years. Firstly, we can plot both survival curves on the one Kaplan-Meier plot using:

compare <- survfit(Surv(time = time, event = delta) ~ Year, data = Spotify)
plot(compare, xlab="Time (days)", ylab="Proportion Survival",col= c('red','blue') , mark.time=TRUE)
legend("topright", c("2017", "2018"), lty = 1 ,col= c('red','blue'))

You can take a look at ?plot.survfit for help with arguments for the plot. Now we can compare the survival curves using the log-rank test with:

survdiff(Surv(time = time, event = delta) ~ Year, data = Spotify)
## Call:
## survdiff(formula = Surv(time = time, event = delta) ~ Year, data = Spotify)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## Year=2017 2372     2172     2075      4.51      11.6
## Year=2018 2483     2283     2380      3.94      11.6
## 
##  Chisq= 11.6  on 1 degrees of freedom, p= 7e-04

Are the survival curves significantly different between 2017 and 2018?

Challenge 2

Now that you have used survival analysis at year level, try using survival analysis on the same dataset, but for different artists for different years. Some of the artists you can make comparision between are listed below:

Example 1:

Spotify_2017_Sheeran <- Spotify %>% filter(Artist == 'Ed Sheeran' & Year == 2017)
Spotify_Surv_Sheeran2017 <- Surv(Spotify_2017_Sheeran$time, Spotify_2017_Sheeran$delta)
Spotify_Surv_Sheeran2017
##  [1]  12  208+  13    9  208+  30+   4  121+  25   82   88  170  200+  87   25 
## [16]  13   87    5    1    1   12   87    6  100    6    1   99    6    9    1 
## [31]  18    1  114   25  208+  88   87   76    9   25   82  208+
Spotify_Survfit_Sheeran2017 <- survfit(Spotify_Surv_Sheeran2017 ~ 1)
plot(Spotify_Survfit_Sheeran2017,conf.int="none",mark.time=TRUE,xlab="Time(days)",
ylab="Proportion Survival")
title("A study of ‘incredible longevity’ of a generic pop music in chart.")

Example 2:

Spotify_Drake <- Spotify %>% filter(Artist == 'Drake')
compare <- survfit(Surv(time = time, event = delta) ~ Year, data = Spotify_Drake)
plot(compare, xlab="Time (days)", ylab="Proportion Survival",col= c('red','blue') , mark.time=TRUE)
title("Drake Fight V17/18")
legend("topright", c("2017", "2018"), lty = 1 ,col= c('red','blue'))