Classification - Survival Blend

Approach

Need to predict a time-based event, most common models, whether regression, classification or survival, can get you there but the quality, type of answer, and path taken will vary. A regression model will return a time estimate, a classification model will return the probability of an event at x time, and a survival model will return probabilities of an event not happening over various time frames. We’ll skip the regression model here as we’re not only interested in the time estimate but also the probability of an outcome. With a regression model you would have to first model the outcome and then figure out the time estimate.

Instead, use a survival model (ranger: A Fast Implementation of Random Forests) that will give us an outcome probability over a time continuum (flipping the non-event to event probability), and a classification model (gbm: Generalized Boosted Regression Models), where we’ll measure the probability of the same event happening within x periods. Then look at two ways of ensembling the models and hope for synergy.

Use the Area under the curve (AUC) to measure the different approaches. Compare both estimates, then average out the results from both models, and then ensemble them.

actg320 <-read.csv(url("https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv?cm_mc_uid=51304980933215218170416&cm_mc_sid_50200000=92178841521817041648&cm_mc_sid_52640000=98592221521817041652"))
dim(actg320)
## [1] 7043   21
plot(sort(actg320$tenure), pch='.', type='o', 
     col='blue', lwd=2 , 
     main = 'Telecom Churn')

A survival model needs two outcome variables: a time variable and an outcome/event variable. Every observation in the data set needs a time period. The event outcome, on the other hand, doesn’t need to be fully known, in contrast with a logistic regression or classification model which requires training on a known outcome. Instead of needing a true/false, sick/healthy, or dead/alive, a survival model uses the concept of the event, something either has happened or we don’t know.

Random Forest Survival

Use a random forest survival model as it offers advantages like capturing non-linear effects that a traditional model cannot do and be easily distributed over multiple cores. The two models that are used are the ranger package and the randomForestSRC package. Focus on the ranger model as it doesn’t require additional steps to get it to work on multiple cores.

# install.packages('ranger')
library(ranger)
# install.packages('survival')
library(survival)


for(i in 1:ncol(actg320)){
  actg320[is.na(actg320[,i]), i] <- mean(actg320[,i], na.rm = TRUE)
}
## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA

## Warning in mean.default(actg320[, i], na.rm = TRUE): argument is not
## numeric or logical: returning NA
str(actg320)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
actg320$Partner<-ifelse(actg320$Partner=='Yes', 1,0)
actg320$Dependents<-ifelse(actg320$Dependents=='Yes', 1,0)
actg320$Churn<-ifelse(actg320$Churn=='Yes', 1,0)

Set up our formula. Survival models require two values in the Surv function, the time period followed by the outcome. After the tilde add predictors as it typically done with most modeling formulas.

survival_formula <- formula(paste('Surv(', 'tenure', ',', 'Churn', ') ~ ','gender+SeniorCitizen+Partner+Dependents+PhoneService+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract+PaperlessBilling+PaymentMethod+MonthlyCharges+TotalCharges'))

survival_model <- ranger(survival_formula,
                         data = actg320,  
                         seed = 1234,
                         importance = 'permutation',
                         mtry = 2,
                         verbose = TRUE,
                         num.trees = 50,
                         write.forest=TRUE)
print("error:"); print(survival_model$prediction.error)
## [1] "error:"
## [1] 0.1255809
# print out coefficients
sort(survival_model$variable.importance)
##           gender    SeniorCitizen     PhoneService PaperlessBilling 
##    -0.0006100947     0.0004172687     0.0017123045     0.0027773576 
##    MultipleLines       Dependents  StreamingMovies      StreamingTV 
##     0.0048602458     0.0050377303     0.0050971282     0.0064323979 
##          Partner    PaymentMethod DeviceProtection     OnlineBackup 
##     0.0085659599     0.0110775089     0.0139850505     0.0164716331 
##   OnlineSecurity  InternetService      TechSupport   MonthlyCharges 
##     0.0191985314     0.0195995215     0.0206194788     0.0245008385 
##         Contract     TotalCharges 
##     0.0756021120     0.0817716005

Once we have survival_model model object, take a look at some probabilities of survival (this is just for illustrative purposes)

plot(survival_model$unique.death.times, survival_model$survival[1,], type='l', col='orange', ylim=c(0.01,1))
lines(survival_model$unique.death.times, survival_model$survival[56,], col='blue')

The plots represent the probability of survival/not reaching event over time. In these cases, the orange line has a much higher provability of not being

plot(survival_model$unique.death.times, survival_model$survival[1,], type='l', col='orange', ylim=c(0.01,1))
for (x in c(2:100)) {
  lines(survival_model$unique.death.times, survival_model$survival[x,], col='red')
}

Area Under The Curve

In order to measure the AUC of each model we need to split the data randomly (with seed) into two equal parts

set.seed(1234)
random_splits <- runif(nrow(actg320))
train_df_official <- actg320[random_splits < .5,]
dim(train_df_official)
## [1] 3498   21
validate_df_official <- actg320[random_splits >= .5,]
dim(validate_df_official)
## [1] 3545   21

Generalized Boosted Regression Model

Focus on the probability of reaching event over the first 60 days

We also need to create a classification-centric outcome variable. This will measure how many customer reached event or not within the chosen period. Here we look for a censor feature of 1 (i.e. the event happened) under the chosen period to set the outcome to 1, everything else is set to 0:

Classification data set

period_choice <- 60 # 103 
train_df_classificaiton  <- train_df_official 
train_df_classificaiton$ReachedEvent <- ifelse((train_df_classificaiton$Churn==1 & 
                                                  train_df_classificaiton$tenure<=period_choice), 1, 0)
summary(train_df_classificaiton$ReachedEvent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2481  0.0000  1.0000
validate_df_classification  <- validate_df_official 
validate_df_classification$ReachedEvent <- ifelse((validate_df_classification$Churn==1 & 
                                                     validate_df_classification$tenure<=period_choice), 1, 0)
summary(validate_df_classification$ReachedEvent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2561  1.0000  1.0000

Now we can easily get an AUC score on the probability of reaching event within our allotted period choice

Classification

Let’s run and score our classification GBM model:

feature_names <- setdiff(names(train_df_classificaiton), c('ReachedEvent', 'tenure', 'Churn', 'customerID'))

# isntall.packages('gbm')
library(gbm)
## Loaded gbm 2.1.4
classification_formula <- formula(paste('ReachedEvent ~ ','gender+SeniorCitizen+Partner+Dependents+PhoneService+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract+PaperlessBilling+PaymentMethod+MonthlyCharges+TotalCharges'))

set.seed(1234)
gbm_model = gbm(classification_formula, 
                data =  train_df_classificaiton,
                distribution='bernoulli',
                n.trees=500,         
                interaction.depth=3,
                shrinkage=0.01,
                bag.fraction=0.5,
                keep.data=FALSE,
                cv.folds=5)

nTrees <- gbm.perf(gbm_model)

validate_predictions <- predict(gbm_model, newdata=validate_df_classification[,feature_names], type="response", n.trees=nTrees)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc1=roc(response=validate_df_classification$ReachedEvent, predictor=validate_predictions)
roc1
## 
## Call:
## roc.default(response = validate_df_classification$ReachedEvent,     predictor = validate_predictions)
## 
## Data: validate_predictions in 2637 controls (validate_df_classification$ReachedEvent 0) < 908 cases (validate_df_classification$ReachedEvent 1).
## Area under the curve: 0.8708
plot(roc1)

Scoring the Random Forest Survival Model

Score RF survival model for the period in question:

survival_model <- ranger(survival_formula,
                           data = train_df_official,
                           seed=1234,
                           verbose = TRUE,
                           num.trees = 50,
                           mtry = 2,
                           write.forest=TRUE )

The survival_model object can offer probabilities on periods it has trained on

Here is the tricky part, in order to get an AUC score out of a survival model, we need to choose our period and reverse the probability - as we’re interested in the probability of reaching event versus the probability of not reaching event.

First we get the basic survival prediction using our validation split set and then we flip the probability of the period of choice and get the AUC score

suvival_predictions <- predict(survival_model, validate_df_official[, c("gender","SeniorCitizen","Partner","Dependents","PhoneService","MultipleLines","InternetService","OnlineSecurity","OnlineBackup","DeviceProtection","TechSupport","StreamingTV","StreamingMovies","Contract","PaperlessBilling","PaymentMethod","MonthlyCharges","TotalCharges")])

roc2=roc(response=validate_df_classification$ReachedEvent, predictor=1 - suvival_predictions$survival[,which(suvival_predictions$unique.death.times==period_choice)])
roc2
## 
## Call:
## roc.default(response = validate_df_classification$ReachedEvent,     predictor = 1 - suvival_predictions$survival[, which(suvival_predictions$unique.death.times ==         period_choice)])
## 
## Data: 1 - suvival_predictions$survival[, which(suvival_predictions$unique.death.times == period_choice)] in 2637 controls (validate_df_classification$ReachedEvent 0) < 908 cases (validate_df_classification$ReachedEvent 1).
## Area under the curve: 0.8482
plot(roc2)

Now that both models can predict the same period and the probability of reaching the event, we average them together and see how they help each other (straight 50/50 here which may not be the best mix)

Blend both together

roc3=roc(predictor = (validate_predictions + (1 - suvival_predictions$survival[,which(suvival_predictions$unique.death.times==period_choice)]))/2, 
    response = validate_df_classification$ReachedEvent)

roc3
## 
## Call:
## roc.default(response = validate_df_classification$ReachedEvent,     predictor = (validate_predictions + (1 - suvival_predictions$survival[,         which(suvival_predictions$unique.death.times == period_choice)]))/2)
## 
## Data: (validate_predictions + (1 - suvival_predictions$survival[, which(suvival_predictions$unique.death.times == period_choice)]))/2 in 2637 controls (validate_df_classification$ReachedEvent 0) < 908 cases (validate_df_classification$ReachedEvent 1).
## Area under the curve: 0.8651

Ensembling Survival and Classification

We need to split our data set into two smaller chunks. This will allow us to get a survival probability on all the data, not just our validation set. So, we split the training set into to parts, and validation into two parts, then use one training chunk to train and predict on the other training chunk and half of the validation chunk. We then train on the other half of our original training set, to predict on the first split chunk of training data and the second half of the testing data!

split training into two datasets

set.seed(1234)
random_splits <- runif(nrow(train_df_official))
train_1 <- train_df_official[random_splits < .5,]
dim(train_1)
## [1] 1755   21
train_2 <- train_df_official[random_splits >= .5,]
dim(train_2)
## [1] 1743   21
# split testing set in two
set.seed(1234)
random_splits <- runif(nrow(validate_df_official))
test_1 <- validate_df_official[random_splits < .5,]
dim(test_1)
## [1] 1774   21
test_2 <- validate_df_official[random_splits >= .5,]
dim(test_2)
## [1] 1771   21
surv_1 <- ranger(survival_formula,
                 data =  train_1,
                 verbose = TRUE,
                 seed=1234,
                 num.trees = 50,
                 mtry = 2,
                 write.forest=TRUE )

preds <- predict( surv_1, rbind(train_2[,feature_names], test_2[,feature_names]))
preds_1 <- data.frame(preds$survival)


surv_2 <- ranger(survival_formula,
                 data = train_2,
                 verbose = TRUE,
                 seed=1234,
                 num.trees = 50,
                 mtry = 2,
                 write.forest=TRUE )

preds <- predict( surv_2, rbind(train_1[,feature_names], test_1[,feature_names]))
preds_2 <- data.frame(preds$survival)

We now rebuild the data set into the orginal training and validation portions with the an added feature: the survival probablity for our period of interest

train_2_ensemble <- cbind(train_2, preds_1[1:nrow(train_2),which(surv_1$unique.death.times == period_choice)])
names(train_2_ensemble)[ncol(train_2_ensemble)] <- 'survival_probablities'
dim(train_2_ensemble)
## [1] 1743   22
names(train_2_ensemble)
##  [1] "customerID"            "gender"               
##  [3] "SeniorCitizen"         "Partner"              
##  [5] "Dependents"            "tenure"               
##  [7] "PhoneService"          "MultipleLines"        
##  [9] "InternetService"       "OnlineSecurity"       
## [11] "OnlineBackup"          "DeviceProtection"     
## [13] "TechSupport"           "StreamingTV"          
## [15] "StreamingMovies"       "Contract"             
## [17] "PaperlessBilling"      "PaymentMethod"        
## [19] "MonthlyCharges"        "TotalCharges"         
## [21] "Churn"                 "survival_probablities"
test_2_ensemble <- cbind(test_2, preds_1[((nrow(train_2_ensemble)+1):nrow(preds_1)),which(surv_1$unique.death.times == period_choice)])
names(test_2_ensemble)[ncol(test_2_ensemble)] <- 'survival_probablities'
train_1_ensemble <- cbind(train_1, preds_2[1:nrow(train_1),which(surv_2$unique.death.times == period_choice)])
names(train_1_ensemble)[ncol(train_1_ensemble)] <- 'survival_probablities'
dim(train_1_ensemble)
## [1] 1755   22
test_1_ensemble <- cbind(test_1, preds_2[((nrow(train_1_ensemble)+1):nrow(preds_2)),which(surv_2$unique.death.times == period_choice)])
names(test_1_ensemble)[ncol(test_1_ensemble)] <- 'survival_probablities' 
dim(test_1_ensemble)
## [1] 1774   22

Finally bring them both back together

train_df_final <- rbind(train_1_ensemble, train_2_ensemble)
validate_df_final <- rbind(test_1_ensemble, test_2_ensemble)

Create classification model outcome again on our new data set:

train_df_final$ReachedEvent <- ifelse((train_df_final$Churn==1 & 
                                           train_df_final$tenure <= period_choice), 1, 0)
summary(train_df_final$ReachedEvent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2481  0.0000  1.0000
validate_df_final$ReachedEvent <- ifelse((validate_df_final$Churn==1 & 
                                            validate_df_final$tenure<= period_choice), 1, 0)

Modeling our augmented data set with our GBM classification model:

feature_names <- setdiff(names(train_df_final), c('ReachedEvent', 'time', 'censor', 'customerID'))

classification_formula <- formula(paste('ReachedEvent ~ ','gender+SeniorCitizen+Partner+Dependents+PhoneService+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract+PaperlessBilling+PaymentMethod+MonthlyCharges+TotalCharges+survival_probablities'))

set.seed(1234)
gbm_model = gbm(classification_formula, 
                data =  train_df_final,
                distribution='bernoulli',
                n.trees=500,         
                interaction.depth=1,
                shrinkage=0.01,
                bag.fraction=0.5,
                keep.data=FALSE,
                cv.folds=5)

nTrees <- gbm.perf(gbm_model)

validate_predictions <- predict(gbm_model, newdata=validate_df_final[,feature_names], type="response", n.trees=nTrees)
roc4=roc(response=validate_df_final$ReachedEvent, predictor=validate_predictions)
roc4
## 
## Call:
## roc.default(response = validate_df_final$ReachedEvent, predictor = validate_predictions)
## 
## Data: validate_predictions in 2637 controls (validate_df_final$ReachedEvent 0) < 908 cases (validate_df_final$ReachedEvent 1).
## Area under the curve: 0.8552
plot(roc4)

plot(roc1, colorize = TRUE)
plot(roc2, add = TRUE, colorize = TRUE, col = 'red')
plot(roc3, add = TRUE, colorize = TRUE, col = 'blue')
plot(roc4, add = TRUE, colorize = TRUE, col = 'green')