Task description and background

The business problem stems from the marketing domain and is related to managing customer churn. Customer churn is a severe problem in many service industries including telecommunications, banking and insurance, energy, and many more. Taking a company perspective, acquiring new customers (e.g., to compensate for leaving customers) is substantially more costly than retaining existing customers. Moreover, long‐term customers generate higher profits, are less sensitive to competitive actions, and may act as promoters through positive word of mouth. Although relationship management instruments such as loyalty programs often reduce churn, customer attrition remains a major threat to the financial health of many companies. For example, T‐Mobile USA lost half a million of its most lucrative customers in the first quarter of 2012. It is known that contract churn rates for many types of communication services are in the three‐percent‐per month range. This means that a provider needs to refresh nearly 100% of its customer base about every three years.

The purpose of a churn model is to predict the likelihood that some customer ends her/his business relationship in the near future. Equipped with such information, the company can initiate a marketing campaign to proactively prevent customer attrition. In the mobile telecommunication industry, for example, the company could contact customers with high churn risks and offer them a new phone if they decide to renew their contract. In the assignment, students are given a real‐world data set from a large mobile telecommunication company in the US, and are asked to develop a prediction model to estimate customers’ churn risk as accurately as possible.

Data

The customer data set includes 100,000 observations (customers) each of which is described by 171 attributes. Roughly speaking, the attributes capture demographic, sociographic and microgeographic characteristics of the customers, information about their interactions with the company, and their service usage behavior in particular (e.g., how many calls, average length per call, etc.). A description of the attributes is available in the file BAPM_Assignment_DataDescription.xls.

The data set includes two special attributes: Customer_ID, and churn. The former is a unique identifier of the customer. The latter is a binary target (dependent) variable. The two states of this variable capture whether a customer did churn (churn=1) or not (churn=0), after showing some ‘behavior’, which is represented by the remaining variables. For example, when gathering the data in the first place, the company did first collect the values of the customer attributes, say in one month \(t\), and then observed whether the customer churned in a subsequent month, e.g., \(t+1\).

Task specification

Let \(x_i\) denote a vector that contains the values of the attributes of customer \(i\). The assignment task is then formally given as estimating the conditional probability of \(y_i = 1\) given \(x_i\); \(P(y_i=1|x_i)\), where the scalar \(y_i\) represents the value of the target variable of customer \(i\); that is, whether customer \(i\) churned (\(y_i=1\)) or not (\(y_i=0\)).

The data set has been randomly split in two parts of equal size, a training and a test set. The information whether a customer churned is available in the training set only. The task of the assignment is to build the churn model using the data of the training set. Subsequently, the fully‐built model is to be applied to the test set to predict churn probabilities for the customers therein. Next, the results are to be submitted in a CSV file with the following format:

Customer_ID; EstimatedChurnProbability

1, 0.45

2, 0.32

3, 0.87

Performance Evaluation

Churn models and more generally scoring models for binary events can be assessed in a number of ways. The assignment focusses on predictive accuracy. The predicted churn probabilities (of test set customers!) will be compared to the actual churn outcomes, which are known to the lecturer only. The specific indicator used to summarize this comparison and calculate predictive accuracy is the lift‐measure. This measure is commonly used in direct marketing to assess targeting models. The lift measure grounds on a list of customers ordered according to their model‐estimated scores (from highest to lowest risk in churn prediction). For some decile \(d\) of the ordered customer list, the lift measure \(L_d\) is defined as:

\[L_d=\frac{\hat \pi_d}{\hat \pi}\]

where \(\hat \pi\) and \(\hat \pi_d\) denote the fraction of actual churners among all customers and those ranked in the top‐\(d\) decile, respectively. If a retention program were targeting customers at random, the fraction of actual churners reached with the campaign would equal \(\hat \pi\)(that fraction that can be expected by chance). The lift measure quantifies how much a churn model improves over such a random targeting.

An important feature of the lift measure is that it captures, under some assumptions, the profit of a marketing campaign (e.g., the profit of a retention program). Although the lift measure can be defined for any decile \(d\) of the ranked customer list, it is common practice to set \(d\) to some small value. In the assignment this value is \(0.1\), which means that the accuracy of students’ churn predictions will be assessed on the 10% of test set customers with highest estimated churn probabilities.

Solution

Reading the data

There are some NAs in the data. To get rid of them we first omit columns with lots of NA values, and then the rows that have NA values.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 1.9-8
train<-read.csv("~/R/Churn prediction/BAPM_Trainingset.csv")
colnames(train)<-tolower(colnames(train))
#let's find the variables with lots of NAs
x<-rep(NA,ncol(train))
names(x)<-colnames(train)
for (i in 1:ncol(train)){
  x[i]<-sum(is.na(train[,i]))
}
x[x>1000]
##      crtcount       rmcalls         rmmou         rmrev       avg6mou 
##         48256         42964         42964         42964          1436 
##       avg6qty       avg6rev       ref_qty       tot_ret      tot_acpt 
##          1436          1436         47746         48083         48083 
## pre_hnd_price    hnd_webcap           lor        adults        income 
##         28719          5162         15188         11570         12779 
##      numbcars         educ1       retdays 
##         24755         43278         48083
baddata<-names(x[x>1000])
train<-train[,-which(colnames(train) %in% baddata)]
#Get read of NAs in rev_Mean and some other columns
train<-train[!is.na(train$rev_mean),]
#factorize all factor variables
factorVar<-read.csv("~/R/Churn prediction/factorVar.csv",header=FALSE)
factorVar[,1]<-tolower(factorVar[,1])
#don't want to factorize customer_id
factorVar<-factorVar[-which(factorVar=="customer_id"),]
factorCol<-which(colnames(train) %in% factorVar)
for (i in factorCol){
  train[,i]<-as.factor(train[,i])
}
train<-na.omit(train)

Since we have quite big amount of variables, we may want to get red of some of them. To do that we will use some regularization approaches. But first, let’s check how many levels have our factor variables.

nlevel<-vector()
for (i in 1:ncol(train)){
  nlevel[i]<-length(levels(train[,i]))
}
hlevel=which(nlevel>10)
hlevel
##  [1]  96  99 115 116 119 120 121 122 126 131 132 141 145

The CSA variable has a lot of levels. We will replace it by first three letters of the code.

train$csa<-substr(train$csa,1,3)
train$csa<-factor(train$csa)
levels(train$csa)
##  [1] ""    "AIR" "APC" "ATH" "ATL" "AWI" "BIR" "BOS" "CHI" "DAL" "DEN"
## [12] "DET" "FLN" "GCW" "HAR" "HOU" "HWI" "IND" "INH" "INU" "IPM" "KCY"
## [23] "LAU" "LAW" "LAX" "LOU" "MIA" "MIL" "MIN" "NCR" "NEV" "NMC" "NMX"
## [34] "NNY" "NOL" "NOR" "NSH" "NVU" "NYC" "OHH" "OHI" "OKC" "OMA" "PHI"
## [45] "PHX" "PIT" "SAN" "SDA" "SEA" "SEW" "SFR" "SFU" "SHE" "SLC" "SLU"
## [56] "STL" "VAH"

Now, the last_swap, which is actually a date

train$last_swap<-as.Date(train$last_swap,"%m/%d/%Y")
train$last_swap[is.na(train$last_swap)]<-mean(train$last_swap,na.rm = TRUE )

Evaluation function

Now we create a function that will evaluate our models in the way that is described above.

evaluation<-function(pred,train){
   top10<-pred[pred[,2]>=quantile(pred[,2],probs=.9),]
   top10real<-train[train$customer_id %in% top10[,1],]
   new=sum(top10real$churn==1)/nrow(top10real)
   total=sum(train[train$customer_id %in% pred[,1],]$churn==1)/nrow(pred)
   new/total
}

now, to the regularization.

Lasso regularization.

library(glmnet)
glmtrain=model.matrix(churn~.,train)
y=train$churn
cv.out=cv.glmnet(glmtrain,y,alpha=1,family="binomial",type.measure="class")
plot(cv.out)

lasso.pred=predict(cv.out,s=cv.out$lambda.1se,newx=glmtrain,type="response")
pred<-data.frame(train$customer_id,lasso.pred)
evaluation(pred,train)
## [1] 1.384368

Proper Cross-validation

Since we have very specific metric for evaluating the result we may try to implement our own cross-validation, measuring the results with our evaluation function.

glmtrain=model.matrix(churn~.,train)
y=train$churn
k=10
lambda=cv.out$lambda
m=length(lambda)
folds=sample(1:k,nrow(train),replace=TRUE)
cv.errors=matrix(NA,k,m,dimnames=list(NULL,paste(1:m)))

for (j in 1:k){
  best.fit=glmnet(glmtrain[folds!=j,],y[folds!=j],
                  alpha=1,family="binomial",lambda=lambda)
  cv.errors[j,]=sapply(1:m,function(x){pred0=predict(best.fit,s=best.fit$lambda[x],newx=glmtrain[folds==j,],type="response")
                        pred<-data.frame(train[folds==j,]$customer_id,pred0)
                        evaluation(pred,train)
                        })
  }
par(mfrow=c(2,1))
plot(apply(cv.errors,2,mean))
plot(cv.out$nzero)

Now we take the best result and do the prediction for the whole model

wm=which.max(apply(cv.errors,2,mean))
lasso.pred=predict(cv.out,s=lambda[wm],newx=glmtrain,type="response")
pred<-data.frame(train$customer_id,lasso.pred)
evaluation(pred,train)
## [1] 1.430876

Our graphs show that good choice for lambda would be around 33, as cross-validation seems to be giving good results and we also will have least possible amount of non-null coefficients.

lasso.pred=predict(cv.out,s=lambda[33],newx=glmtrain,type="response")
pred<-data.frame(train$customer_id,lasso.pred)
evaluation(pred,train)
## [1] 1.391071