Class 3 : Business Case

Logistic Regression in Business

1. Get data and first overview

One of the powerful features of R is the possibility to use data from Excel, CSV, SAS, SPSS, SQL, JSON, API, etc.

Two of the most important ones:

Reading from csv files:
Reading directly from “internet”

library(dplyr)
library(caret)
setwd("/Users/martinliendo/Documents/Proyectos/elective_meba")

path <- "https://raw.githubusercontent.com/dataoptimal/posts/master/business%20impact%20project/Telco%20Data.csv"

dataset <- read.csv(path)

2. First overview: Data Exploratory

glimpse(dataset)

## Observations: 7,043
## Variables: 21
## $ customerID       <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 92...
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Fe...
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No,...
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No,...
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49...
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes...
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Ye...
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fibe...
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, ...
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No...
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No...
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No i...
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No ...
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No i...
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year...
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes,...
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 2...
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1...
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No...

So, we have 7043 observation with 21 variables with a mix of discrete and continous variables for each customer. Our dependent variable is "Churn" which is a binary (YES/NO) variable

summary(dataset)

##       customerID      gender     SeniorCitizen    Partner    Dependents
##  0002-ORFBO:   1   Female:3488   Min.   :0.0000   No :3641   No :4933  
##  0003-MKNFE:   1   Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110  
##  0004-TLHLJ:   1                 Median :0.0000                        
##  0011-IGKFF:   1                 Mean   :0.1621                        
##  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
##  0013-MHZWF:   1                 Max.   :1.0000                        
##  (Other)   :7037                                                       
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.37                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies           Contract   
##  No                 :2810   No                 :2785   Month-to-month:3875  
##  No internet service:1526   No internet service:1526   One year      :1473  
##  Yes                :2707   Yes                :2732   Two year      :1695  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling                   PaymentMethod  MonthlyCharges  
##  No :2872         Bank transfer (automatic):1544   Min.   : 18.25  
##  Yes:4171         Credit card (automatic)  :1522   1st Qu.: 35.50  
##                   Electronic check         :2365   Median : 70.35  
##                   Mailed check             :1612   Mean   : 64.76  
##                                                    3rd Qu.: 89.85  
##                                                    Max.   :118.75  
##                                                                    
##   TotalCharges    Churn     
##  Min.   :  18.8   No :5174  
##  1st Qu.: 401.4   Yes:1869  
##  Median :1397.5             
##  Mean   :2283.3             
##  3rd Qu.:3794.7             
##  Max.   :8684.8             
##  NA's   :11

Most of our variables are already a factor and not a character class. This is better for models. But one variable "Senior Citizen" is an integer with only two values. We transform this in a factor variable

dataset$SeniorCitizen <- as.factor(dataset$SeniorCitizen)

Missing Values

In a project of any size, data is likely to be incomplete , improperly coded/labeled data, etc. These values are represented by the symbol NA (not available) in R.

To detect them is na(X) returns a boolean (TRUE if the observation is missing)

sapply(dataset, function(x) sum(is.na(x)))

##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

table(dataset$Churn, is.na(dataset$TotalCharges))

##      
##       FALSE TRUE
##   No   5163   11
##   Yes  1869    0

Dealing with missing values:

Deleting missing observations (“listwise deletion”). Recommended when the number of missing data is particularly small. Check that there are still sufficient data points and not to introduce a bias by deleting.
Deleting the variable: when a variable has a large number of missing value and it doesn’t have a sound meaning for the project, then it can be removed.
Fill missing data with reasonable possibles values:

3.1 Median or mean imputation

3.2 Model Imputation, for example Knn Imputation

# We continue with the first way , we eliminate the rows with missing values 

dataset <- dataset[complete.cases(dataset), ]

#To impute with the mean 
# dataset <- dataset %>% 
#       mutate(TotalCharges = replace(TotalCharges,
#                                is.na(TotalCharges),mean(TotalCharges, na.rm = T)))

Two things to prepare for the final model

1. Remove columns unnecessary for the model : CustomerID

dataset <- dataset %>% select(-customerID, -TotalCharges)

1. Transform and analyze each variable in more detail (homework).

3. Split the Dataset

library(caTools)
set.seed(10) # to ensure the reproducibility in the random numbers
# split <- createDataPartition(dataset$Churn, p=0.75, list=FALSE)
split = sample.split(dataset$Churn , SplitRatio = 0.8)
train <- dataset[split==TRUE,]
train_churn <- dataset[split==TRUE, "Churn"]

test <- dataset[split ==FALSE,]
test_churn <- dataset[split==TRUE, "Churn"]

4. Model the data

Always bear in mind the problem that you need to solve. In today’s markets it is very common that it is easier to lose a client rather than to acquire a new client. Hence, it is important to analyze our existing clients.

What is the problem that we are trying to solve here?.

1. Logistic Regression.

lm_model-Logistic-

fit1 <- glm(formula = Churn~., data=train, family=binomial)

pred_logistic <- predict(fit1, test, type="response")
y_pred <- ifelse(pred_logistic > 0.5, "Yes", "No")

confusionMatrix(test$Churn, y_pred, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  938  95
##        Yes 160 214
##                                           
##                Accuracy : 0.8188          
##                  95% CI : (0.7976, 0.8386)
##     No Information Rate : 0.7804          
##     P-Value [Acc > NIR] : 0.0002182       
##                                           
##                   Kappa : 0.5084          
##  Mcnemar's Test P-Value : 6.128e-05       
##                                           
##             Sensitivity : 0.6926          
##             Specificity : 0.8543          
##          Pos Pred Value : 0.5722          
##          Neg Pred Value : 0.9080          
##              Prevalence : 0.2196          
##          Detection Rate : 0.1521          
##    Detection Prevalence : 0.2658          
##       Balanced Accuracy : 0.7734          
##                                           
##        'Positive' Class : Yes             
##

The output of the predict function returns probabilities , now we need to set the threshold. One (common) possibility is to set the threshold in prob = 0.5. But we can directly try to set the threshold subject to the minimization of cost for a hypothetical company

5. Minimize cost

Suppose that one month of our service is 100 euros. Then, the worst case scenario is that we lose a customer that we believe would not churn. Lets say that this customer cost us ten month worth of his payment due to hard competition in the industry. In addition, we would assume that we spend two month (200) of the customer fee per year only in retain the client.

FN (predict that a customer won’t churn, but they actually do): $1000
TP (predict that a customer would churn, when they actually would): $200
FP (predict that a customer would churn, when they actually wouldn’t): $200
TN (predict that a customer won’t churn, when they actually wouldn’t): $0

type_error

What we will try to do is to set different threshold that will change the confusion matrix and then choose the one that minimizes the cost for our company.

th <- seq(0.1,1.0, length = 10)
total_cost = rep(0,length(th))
for (i in 1:length(th)){
      
      pred = rep("No", length(pred_logistic))
      pred[pred_logistic > th[i]] = "Yes"
      pred <- as.factor(pred)
      conf <- confusionMatrix(pred, test$Churn, positive = "Yes")
      TN <- conf$table[1]
      FP <- conf$table[2]
      FN <- conf$table[3]
      TP <- conf$table[4]
      total_cost[i] = (FN*1000 + TP*200 + FP*200 + TN*0)/1000
}

library(ggplot2)
library(plotly)

dt <- data.frame(th, total_cost)
my_chart <- ggplot(dt, aes(x = th, y = total_cost)) +
  geom_line() +
  geom_point() +
  theme()
ggplotly(my_chart)

Let’s discuss the findings!!

References

The core of this class is base in this post:
TowardsDataScience

The logistic short post can be found in :
SaedSayad

For more technical knowledge:
Introduction to Statistical Learning in R website