First we need to become familiaw with how to perform the survival function in R. We need to install the required packages survival and dplyr.
install.packages("survival")
install.packages("dplyr")
# Add packages to Library
library("survival")
library("dplyr")
Now in our example we will explore the time until customers’ phone contract cancellations. let’s create two vectors. The first vector is for the recorded time period and the second for the event. In this case the time is the period where a customer has cancelled its contract and the event “1” is the event happening (contract cancellation). In this example, eventually all customers have cancelled their contracts.In the following example we will consider also tihe case of censoring data.
# Lecture Simple example
time <- c(3,7,16,25,32,44,49,50,55,72)
event <- c(1,1,1,1,1,1,1,1,1,1)
The Surv() function from the {survival} package creates a survival object that we will use as input in most of the functions in survival analysis. To create a survival object we need two arguments the time and the status indicator (event happened or not).
The Kaplan-Meier method is the most common way to estimate survival times and probabilities. The survfit() function creates survival curves using the Kaplan-Meier method based on a formula. The syntax is the survival object we have created followed by ~1.
# Create a survival object
Ex_Surv <- Surv(time, event)
Ex_survfit <- survfit(Ex_Surv ~ 1)
Given that we have done the survival function we can now plot the Kaplan-Meier plot. One way we can do this is with the plot function.
# Create a Kaplan-Meier plot
plot(Ex_survfit, conf.int="none", mark.time=TRUE, xlab="Time (weeks)", ylab="Proportion Survival")
Let’s take the case of censoring data. We will create a slightly different dataset. As previously we have the time of an event happening (event “1”) or the a customer left the study (event “0”). If the event is “1” that means that the event happened to the customer at the time period found in the time vector. If it zero that means that the customer left the program and until that time period we do not actually know if the event have happened to them or not.
# Lecture Simple example 2
time <- c(3,7,16,25,32,44,49,50,55,72)
event <- c(1,0,1,1,0,1,1,0,1,1)
# Create a survival object
Ex_Surv <- Surv(time, event)
Ex_Surv
## [1] 3 7+ 16 25 32+ 44 49 50+ 55 72
If you call Ex_Surv you will see that for the censored observation the time period has a + indicator. The syntax for the rest steps is as before.
Ex_survfit <- survfit(Ex_Surv ~ 1)
# Create a Kaplan-Meier plot
plot(Ex_survfit, conf.int=TRUE, mark.time=TRUE, xlab="Time (weeks)", ylab="Proportion Survival")
Let’s see now how we can perform the log-rank test comparison in R. The log-rank test which is used to compare two separate survival curves from two separate population and identify if there is a statistically significant different between two groups.
The null hypothesis for log-rank test is that the two survival function is equal: S1(t) = S2(t). If you can reject the null hypothesis then you can said that there is a different between two functions. However,you can only compare 2 groups, not more! The log-rank test is a form of the Chi-squared test.
For the purpose of the seminar we will use a built in dataset(AML). This is about the survival in patients with Acute Myelogenous Leukemia. In the dataset we have the time, the censored status (“1” event happened “0” censored) and whether maintained chemotherapy was given. Initially we will follow the same steps as before.
# Acute Myelogenous Leukemia survival example - data included in the survival package
data <- data.frame(aml)
head(data)
## time status x
## 1 9 1 Maintained
## 2 13 1 Maintained
## 3 13 0 Maintained
## 4 18 1 Maintained
## 5 23 1 Maintained
## 6 28 0 Maintained
AML_Surv <- Surv(aml$time, aml$status)
In the previous example we used the syntax ~1 but now we will compare the two groups that are found in x field (maintained chemotherapy or not). After we run this we will have two survival functions, one for each group. To perform the log-rank test we will use the survdiff() function.
AML_survfit <- survfit(AML_Surv ~ aml$x)
plot(AML_survfit, conf.int=FALSE, mark.time=TRUE, xlab="Time", ylab="Proportion Survival")
survdiff(AML_Surv~x, data=aml)
## Call:
## survdiff(formula = AML_Surv ~ x, data = aml)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## x=Maintained 11 7 10.69 1.27 3.4
## x=Nonmaintained 12 11 7.31 1.86 3.4
##
## Chisq= 3.4 on 1 degrees of freedom, p= 0.07
Cox regression model it can be used to measured association between variable and survivability of a population. The Cox proportional-hazards model is essentially a regression model used to investigate the association between the time to an event of interest to a set of variable that might have an effect on that time. It is also commonly found in medical research for investigating link between survival time of patients and one or more predictor variables. In business, it can be used to investigate link between time to an event of interest such as ‘insurance claim’, ‘purchase’, ‘churn’ against variable such as ‘gender’, ‘type of advertisement that brought customer in’, ‘income’, etc.
This is the example from the lecture. This is a built in dataset from the library boot about patients with malignent melanoma. There are variables about the sex,age, year of operations, thickness or tumor etc.
The first thing is to mutate the status to alive or not alive.
melanoma <- boot::melanoma
melanoma <- melanoma%>%mutate(status=if_else(status == 2, 0, # "still alive"
if_else(status == 1, 1, # "died of melanoma"
0)))
model <- coxph(Surv(time, status) ~ ., data = melanoma)
summary(model)
## Call:
## coxph(formula = Surv(time, status) ~ ., data = melanoma)
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex 0.448121 1.565368 0.266861 1.679 0.093107 .
## age 0.016805 1.016947 0.008578 1.959 0.050094 .
## year -0.102566 0.902518 0.061007 -1.681 0.092719 .
## thickness 0.100312 1.105516 0.038212 2.625 0.008660 **
## ulcer 1.194555 3.302087 0.309254 3.863 0.000112 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 1.5654 0.6388 0.9278 2.641
## age 1.0169 0.9833 1.0000 1.034
## year 0.9025 1.1080 0.8008 1.017
## thickness 1.1055 0.9046 1.0257 1.191
## ulcer 3.3021 0.3028 1.8012 6.054
##
## Concordance= 0.757 (se = 0.031 )
## Likelihood ratio test= 44.4 on 5 df, p=2e-08
## Wald test = 40.89 on 5 df, p=1e-07
## Score (logrank) test = 48.14 on 5 df, p=3e-09
The function of cox regression model is coxph. The output is similar to the regression and we can understand which variables are significant.
model <- coxph(Surv(time, status) ~ ., data = melanoma)
summary(model)
## Call:
## coxph(formula = Surv(time, status) ~ ., data = melanoma)
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex 0.448121 1.565368 0.266861 1.679 0.093107 .
## age 0.016805 1.016947 0.008578 1.959 0.050094 .
## year -0.102566 0.902518 0.061007 -1.681 0.092719 .
## thickness 0.100312 1.105516 0.038212 2.625 0.008660 **
## ulcer 1.194555 3.302087 0.309254 3.863 0.000112 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 1.5654 0.6388 0.9278 2.641
## age 1.0169 0.9833 1.0000 1.034
## year 0.9025 1.1080 0.8008 1.017
## thickness 1.1055 0.9046 1.0257 1.191
## ulcer 3.3021 0.3028 1.8012 6.054
##
## Concordance= 0.757 (se = 0.031 )
## Likelihood ratio test= 44.4 on 5 df, p=2e-08
## Wald test = 40.89 on 5 df, p=1e-07
## Score (logrank) test = 48.14 on 5 df, p=3e-09
In this case, we will be using a real dataset of the maximum time period that a music pieces remains on Spotify UK Top 200 Charts (https://spotifycharts.com/regional/gb/daily/latest) from 2017 and 2018. The reason that we are interested in this is because entertainment company like film and music label has a keen interest on having a products on them. It is considered to be a clear criteria of a successful products.
To import the data into R, put the four CSV files into your R working directory then use the following
Spotify = read.csv("Spotify_Chart.csv")
Columns | Description |
---|---|
URL | Spotify URL |
Track.Name | Song name as appeared on Spotify |
Artist | Artist name as appeared on Spotify |
time | Maximum length on Charts |
delta | If the event of the song leaving the chart happened after last day of the year |
Year | Year on Chart |
You can then explore the data using head() and summary()
head(Spotify)
## URL
## 1 https://open.spotify.com/track/000xQL6tZNLJzIrtIgxqSl
## 2 https://open.spotify.com/track/007d7JT41sSc1HqWTs4uw7
## 3 https://open.spotify.com/track/00l1uBtEO4WwmsfxqbeTWP
## 4 https://open.spotify.com/track/00lNx0OcTJrS3MKHcB80HY
## 5 https://open.spotify.com/track/0181HMomm7xM3Ks5YNlA9X
## 6 https://open.spotify.com/track/01uk7IzZyFNfQTDBXxx6NB
## Track.Name Artist time delta Year
## 1 Still Got Time ZAYN 121 1 2017
## 2 Love Myself Hailee Steinfeld 1 1 2017
## 3 Destination Calabria - Radio Edit Alex Gaudino 1 1 2017
## 4 You Don't Know Me - Radio Edit Jax Jones 110 0 2017
## 5 OMG (Young Money Remix) Jae Millz 1 1 2017
## 6 One Last Song Sam Smith 51 0 2017
summary(Spotify)
## URL Track.Name Artist time
## Length:4855 Length:4855 Length:4855 Min. : 1
## Class :character Class :character Class :character 1st Qu.: 1
## Mode :character Mode :character Mode :character Median : 2
## Mean : 21
## 3rd Qu.: 19
## Max. :364
## delta Year
## Min. :0.0000 Min. :2017
## 1st Qu.:1.0000 1st Qu.:2017
## Median :1.0000 Median :2018
## Mean :0.9176 Mean :2018
## 3rd Qu.:1.0000 3rd Qu.:2018
## Max. :1.0000 Max. :2018
If you are interested, maybe try using filter() and summarize() in ‘dplyr’ to see which artists has the most amount of songs in the top 200 charts in 2017 and 2018.
Firstly, using the ‘dplyr’ package, let’s extract the data on just 2017 first using:
Spotify_2017 <- Spotify %>% filter(Year == 2017)
We can then create a survival object with this data using:
Spotify_Surv <- Surv(Spotify_2017$time, Spotify_2017$delta)
From this survival object, we can create a Kaplan-Meier plot using:
Spotify_Survfit <- survfit(Spotify_Surv ~ 1)
plot(Spotify_Survfit, conf.int="none", mark.time=TRUE, xlab="Time (days)", ylab="Proportion
Survival")
We can view the stats on the survival object using:
Spotify_Survfit
## Call: survfit(formula = Spotify_Surv ~ 1)
##
## n events median 0.95LCL 0.95UCL
## [1,] 2372 2172 1 1 2
head(summary(Spotify_Survfit))
## $n
## [1] 2372
##
## $time
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 59 60 61 62 63 64 66 68 69 70 71 72 73 74 75
## [73] 76 77 78 79 80 82 83 84 86 87 88 90 92 93 96 97 98 99
## [91] 100 102 109 110 111 114 116 117 118 121 128 131 135 137 144 145 146 149
## [109] 153 154 155 156 170 175 177 178
##
## $n.risk
## [1] 2372 1140 1060 968 897 837 767 725 698 683 672 656 635 602 593
## [16] 575 570 555 548 546 528 519 512 498 494 472 456 442 423 367
## [31] 359 356 349 345 343 339 333 326 324 318 315 304 302 298 296
## [46] 292 290 284 277 275 273 270 266 263 256 252 246 243 236 235
## [61] 231 227 223 221 218 213 212 204 200 195 193 192 191 187 183
## [76] 179 173 170 167 165 164 161 156 128 125 123 118 113 110 109
## [91] 108 105 101 100 97 96 91 90 87 84 75 72 68 65 62
## [106] 61 60 55 51 50 49 47 44 39 37 31
##
## $n.event
## [1] 1197 79 85 68 53 68 41 26 14 11 15 19 31 7 17
## [16] 3 15 7 2 18 9 4 13 4 21 16 13 19 53 5
## [31] 3 6 4 2 3 5 1 2 6 2 10 2 4 2 4
## [46] 2 4 5 2 2 1 4 3 6 3 5 1 7 1 2
## [61] 3 4 1 1 2 1 7 3 2 2 1 1 4 2 2
## [76] 5 2 3 2 1 2 5 22 1 2 2 5 2 1 1
## [91] 2 2 1 2 1 4 1 2 2 5 2 2 1 1 1
## [106] 1 1 1 1 1 1 1 3 1 2 1
##
## $n.censor
## [1] 35 1 7 3 7 2 1 1 1 0 1 2 2 2 1 2 0 0 0 0 0 3 1 0 1
## [26] 0 1 0 3 3 0 1 0 0 1 1 6 0 0 1 1 0 0 0 0 0 2 2 0 0
## [51] 2 0 0 1 1 1 0 2 0 2 1 0 0 3 3 0 1 1 3 0 0 0 0 2 2
## [76] 1 0 1 0 0 1 0 5 2 1 2 1 1 0 0 1 0 2 1 0 0 1 1 0 4
## [101] 2 0 4 0 2 0 1 5 1 0 1 1 3 0 5 0
##
## $surv
## [1] 0.49536256 0.46103481 0.42406503 0.39427534 0.37097925 0.34083996
## [7] 0.32262035 0.31105052 0.30481168 0.29990256 0.29320831 0.28471600
## [13] 0.27081647 0.26766745 0.25999401 0.25863752 0.25183127 0.24865502
## [19] 0.24774752 0.23958002 0.23549627 0.23368127 0.22774796 0.22591865
## [25] 0.21631482 0.20898212 0.20302430 0.19429701 0.16995246 0.16763704
## [31] 0.16623617 0.16343444 0.16156126 0.16062467 0.15921979 0.15687142
## [37] 0.15640033 0.15544082 0.15256229 0.15160278 0.14678999 0.14582427
## [43] 0.14389282 0.14292710 0.14099565 0.14002993 0.13809848 0.13566717
## [49] 0.13468762 0.13370808 0.13321830 0.13124470 0.12976449 0.12680409
## [55] 0.12531810 0.12283163 0.12233232 0.11880834 0.11830491 0.11729806
## [61] 0.11577471 0.11373463 0.11322461 0.11271228 0.11167822 0.11115391
## [67] 0.10748374 0.10590309 0.10484406 0.10376874 0.10323108 0.10269341
## [73] 0.10054277 0.09946744 0.09838037 0.09563231 0.09452674 0.09285862
## [79] 0.09174654 0.09119050 0.09007842 0.08728095 0.07497210 0.07438638
## [85] 0.07319620 0.07200602 0.06895491 0.06773447 0.06711871 0.06650294
## [91] 0.06527140 0.06402814 0.06339419 0.06212631 0.06148583 0.05892392
## [97] 0.05827641 0.05698138 0.05567146 0.05235768 0.05096148 0.04954588
## [103] 0.04881727 0.04806623 0.04729097 0.04651571 0.04574045 0.04490880
## [109] 0.04402824 0.04314767 0.04226711 0.04136781 0.03854727 0.03755888
## [115] 0.03552867 0.03438259
As you can see, the median suggests 50% of songs does not last longer than 1 day on the chart and the survival rate after one day is less than 50%, which is not surprising given the competitive nature within music industry.
However, knowing the survival rate of music on a chart for a single year doesn’t give you a huge understanding of what happened or see how it differs from others. It would be more interesting to looks at the chart from the subsequent year to see if survival rate of songs on UK Top 200 Chart has change at all.
Now let’s compare the survival rates of the two different years. Firstly, we can plot both survival curves on the one Kaplan-Meier plot using:
compare <- survfit(Surv(time = time, event = delta) ~ Year, data = Spotify)
plot(compare, xlab="Time (days)", ylab="Proportion Survival",col= c('red','blue') , mark.time=TRUE)
legend("topright", c("2017", "2018"), lty = 1 ,col= c('red','blue'))
You can take a look at ?plot.survfit for help with arguments for the plot. Now we can compare the survival curves using the log-rank test with:
survdiff(Surv(time = time, event = delta) ~ Year, data = Spotify)
## Call:
## survdiff(formula = Surv(time = time, event = delta) ~ Year, data = Spotify)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## Year=2017 2372 2172 2075 4.51 11.6
## Year=2018 2483 2283 2380 3.94 11.6
##
## Chisq= 11.6 on 1 degrees of freedom, p= 7e-04
Are the survival curves significantly different between 2017 and 2018?
Now that you have used survival analysis at year level, try using survival analysis on the same dataset, but for different artists for different years. Some of the artists you can make comparision between are listed below:
Example 1:
Spotify_2017_Sheeran <- Spotify %>% filter(Artist == 'Ed Sheeran' & Year == 2017)
Spotify_Surv_Sheeran2017 <- Surv(Spotify_2017_Sheeran$time, Spotify_2017_Sheeran$delta)
Spotify_Surv_Sheeran2017
## [1] 12 208+ 13 9 208+ 30+ 4 121+ 25 82 88 170 200+ 87 25
## [16] 13 87 5 1 1 12 87 6 100 6 1 99 6 9 1
## [31] 18 1 114 25 208+ 88 87 76 9 25 82 208+
Spotify_Survfit_Sheeran2017 <- survfit(Spotify_Surv_Sheeran2017 ~ 1)
plot(Spotify_Survfit_Sheeran2017,conf.int="none",mark.time=TRUE,xlab="Time(days)",
ylab="Proportion Survival")
title("A study of ‘incredible longevity’ of a generic pop music in chart.")
Example 2:
Spotify_Drake <- Spotify %>% filter(Artist == 'Drake')
compare <- survfit(Surv(time = time, event = delta) ~ Year, data = Spotify_Drake)
plot(compare, xlab="Time (days)", ylab="Proportion Survival",col= c('red','blue') , mark.time=TRUE)
title("Drake Fight V17/18")
legend("topright", c("2017", "2018"), lty = 1 ,col= c('red','blue'))