datos <- read.csv("churn.csv")knitr::kable(summary(datos))| churn | accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No :4293 | Min. : 1.0 | no :4527 | no :3677 | Min. : 0.000 | Min. : 0.0 | Min. : 0 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.00 | Min. : 0.0 | Min. : 0.00 | Min. : 0.000 | Min. : 0.00 | Min. : 0.000 | Min. :0.000 | Min. :0.00 | |
| Yes: 707 | 1st Qu.: 73.0 | yes: 473 | yes:1323 | 1st Qu.: 0.000 | 1st Qu.:143.7 | 1st Qu.: 87 | 1st Qu.:24.43 | 1st Qu.:166.4 | 1st Qu.: 87.0 | 1st Qu.:14.14 | 1st Qu.:166.9 | 1st Qu.: 87.00 | 1st Qu.: 7.510 | 1st Qu.: 8.50 | 1st Qu.: 3.000 | 1st Qu.:2.300 | 1st Qu.:1.00 | |
| NA | Median :100.0 | NA | NA | Median : 0.000 | Median :180.1 | Median :100 | Median :30.62 | Median :201.0 | Median :100.0 | Median :17.09 | Median :200.4 | Median :100.00 | Median : 9.020 | Median :10.30 | Median : 4.000 | Median :2.780 | Median :1.00 | |
| NA | Mean :100.3 | NA | NA | Mean : 7.755 | Mean :180.3 | Mean :100 | Mean :30.65 | Mean :200.6 | Mean :100.2 | Mean :17.05 | Mean :200.4 | Mean : 99.92 | Mean : 9.018 | Mean :10.26 | Mean : 4.435 | Mean :2.771 | Mean :1.57 | |
| NA | 3rd Qu.:127.0 | NA | NA | 3rd Qu.:17.000 | 3rd Qu.:216.2 | 3rd Qu.:113 | 3rd Qu.:36.75 | 3rd Qu.:234.1 | 3rd Qu.:114.0 | 3rd Qu.:19.90 | 3rd Qu.:234.7 | 3rd Qu.:113.00 | 3rd Qu.:10.560 | 3rd Qu.:12.00 | 3rd Qu.: 6.000 | 3rd Qu.:3.240 | 3rd Qu.:2.00 | |
| NA | Max. :243.0 | NA | NA | Max. :52.000 | Max. :351.5 | Max. :165 | Max. :59.76 | Max. :363.7 | Max. :170.0 | Max. :30.91 | Max. :395.0 | Max. :175.00 | Max. :17.770 | Max. :20.00 | Max. :20.000 | Max. :5.400 | Max. :9.00 |
plot(datos$churn)plot(density(datos$accountlength))plot(density(datos$totaldayminutes))plot(density(datos$totalnightminutes))datos$churn <- ifelse(datos$churn == "Yes", 1, 0)
summary(datos$churn)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1414 0.0000 1.0000
La fórmula Surv(accountlength, churn) ~ 1 indica que la población únicamente debe dividirse en 1 estrato.
require(survival)
require(survminer)
ajuste <- surv_fit(Surv(accountlength, churn) ~ 1, data = datos)ajuste?str(ajuste)## List of 13
## $ n : int 5000
## $ time : num [1:218] 1 2 3 4 5 6 7 8 9 10 ...
## $ n.risk : num [1:218] 5000 4989 4987 4979 4976 ...
## $ n.event : num [1:218] 1 1 1 1 0 0 0 0 0 0 ...
## $ n.censor : num [1:218] 10 1 7 2 2 2 5 2 3 3 ...
## $ surv : num [1:218] 1 1 0.999 0.999 0.999 ...
## $ type : chr "right"
## $ std.err : num [1:218] 0.0002 0.000283 0.000347 0.000401 0.000401 ...
## $ upper : num [1:218] 1 1 1 1 1 ...
## $ lower : num [1:218] 0.999 0.999 0.999 0.998 0.998 ...
## $ conf.type: chr "log"
## $ conf.int : num 0.95
## $ call : language survfit(formula = Surv(accountlength, churn) ~ 1, data = list(churn = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0| __truncated__ ...
## - attr(*, "class")= chr "survfit"
plot(ajuste)p <- ggsurvplot(ajuste)
plotly::ggplotly(p$plot)ggsurvplot()ggsurvplot(ajuste, fun = "cumhaz", risk.table = TRUE)Usamos una variable que es ella misma dicotómica.
ajuste2 <- surv_fit(Surv(accountlength, churn) ~ voicemailplan, data = datos)p <- ggsurvplot(ajuste2)
plotly::ggplotly(p$plot)También podemos usar las variables continuas y hacerlas dicotómicas. Aquà usamos la media para dividir a nuestra población.
di <- ifelse(datos$totaldayminutes < 180.3, 0,1)
ajuste3 <- surv_fit(Surv(accountlength, churn) ~ di, data = datos)p <- ggsurvplot(ajuste3)
plotly::ggplotly(p$plot)¿Se puede estratificar en más de dos categorÃas? Claro. Todo es posible.
Categorizamos la variable numbercustomerservicecalls usando dplyr.
require(dplyr)
datos2 <- datos%>%
mutate(categ = cut(numbercustomerservicecalls, breaks=c(-1, 2, 5, 10), labels=c("low","midd","high")))ajuste4 <- surv_fit(Surv(accountlength, churn) ~ categ, data = datos2)p <- ggsurvplot(ajuste4)
plotly::ggplotly(p$plot)