Intro to CRISP-DM: Business Case

“Data-Driven decision-making (DDD) refers to the practice of basing decisions on the analysis of data, rather than purely experience or intuition”. Provost & Fawcett

Logistic Regression in Business

-Business CRISP-DM-

1. Data Preparation - Getting the right data

One of the powerful features of R/Python is the possibility to use and combine data from Excel, CSV, SAS, SPSS, SQL, JSON, API, etc.

Available raw material from which the solution will be built.
Cost , strengths and limitations of the data.
It is very important to optimize this step in case that the dataset is large or constantly updated.

library(tidyverse)
library(caret)
library(ggplot2)
library(e1071)


path <- "https://raw.githubusercontent.com/dataoptimal/posts/master/business%20impact%20project/Telco%20Data.csv"

dataset <- read.csv(path)

2. First overview: Data Exploratory

What is Exploratory Data Analysis (EDA)?

How to ensure you are ready to use machine learning algorithms in a project? How to choose the most suitable algorithms for your data set? How to define the feature variables that can potentially be used for machine learning?

Exploratory Data Analysis (EDA) helps to answer all these questions, ensuring the best outcomes for the project. It is an approach for summarizing, visualizing, and becoming intimately familiar with the important characteristics of a data set.

Methods of Exploratory Data Analysis

It is always better to explore each data set using multiple exploratory techniques and compare the results. Once the data set is fully understood, it is quite possible that data scientist will have to go back to data collection and cleansing phases in order to transform the data set according to the desired business outcomes. The goal of this step is to become confident that the data set is ready to be used in a machine learning algorithm.

Exploratory Data Analysis is majorly performed using the following methods:

Univariate visualization: provides summary statistics for each field in the raw data set
Bivariate visualization: is performed to find the relationship between each variable in the dataset and the target variable of interest
Multivariate visualization: is performed to understand interactions between different fields in the dataset
Dimensionality reduction: helps to understand the fields in the data that account for the most variance between observations and allow for the processing of a reduced volume of data.

Through these methods, the data scientist validates assumptions and identifies patterns that will allow for the understanding of the problem and model selection and validates that the data has been generated in the way it was expected to. So, value distribution of each field is checked, a number of missing values is defined, and the possible ways of replacing them are found. Key Concepts of Exploratory Data Analysis

2 types of Data Analysis
Confirmatory Data Analysis
Exploratory Data Analysis
4 Objectives of EDA
- Discover Patterns
- Spot Anomalies
- Frame Hypothesis
- Check Assumptions
Stuff done during EDA
- Trends
- Distribution
- Mean
- Median
- Outlier
- Spread measurement (SD)
- Correlations
- Hypothesis testing
- Visual Exploration

glimpse(dataset)

## Rows: 7,043
## Columns: 21
## $ customerID       <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237…
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Fema…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Y…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, N…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, …
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, …
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Yes,…
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber …
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No…
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No i…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No i…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No int…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No in…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No int…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, …
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, N…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank t…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 194…
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, …

So, we have 7043 observation with 21 variables with a mix of discrete and continous variables for each customer. Our dependent variable is "Churn" which is a binary (YES/NO) variable

summary(dataset)

##       customerID      gender     SeniorCitizen    Partner    Dependents
##  0002-ORFBO:   1   Female:3488   Min.   :0.0000   No :3641   No :4933  
##  0003-MKNFE:   1   Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110  
##  0004-TLHLJ:   1                 Median :0.0000                        
##  0011-IGKFF:   1                 Mean   :0.1621                        
##  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
##  0013-MHZWF:   1                 Max.   :1.0000                        
##  (Other)   :7037                                                       
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.37                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies           Contract   
##  No                 :2810   No                 :2785   Month-to-month:3875  
##  No internet service:1526   No internet service:1526   One year      :1473  
##  Yes                :2707   Yes                :2732   Two year      :1695  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling                   PaymentMethod  MonthlyCharges  
##  No :2872         Bank transfer (automatic):1544   Min.   : 18.25  
##  Yes:4171         Credit card (automatic)  :1522   1st Qu.: 35.50  
##                   Electronic check         :2365   Median : 70.35  
##                   Mailed check             :1612   Mean   : 64.76  
##                                                    3rd Qu.: 89.85  
##                                                    Max.   :118.75  
##                                                                    
##   TotalCharges    Churn     
##  Min.   :  18.8   No :5174  
##  1st Qu.: 401.4   Yes:1869  
##  Median :1397.5             
##  Mean   :2283.3             
##  3rd Qu.:3794.7             
##  Max.   :8684.8             
##  NA's   :11

options(repr.plot.width =6, repr.plot.height = 2)
ggplot(dataset, aes(y= tenure, x = "", fill = Churn)) + 
geom_boxplot()+ 
theme_test()+
xlab(" ")

ggplot(dataset, aes(x=StreamingMovies,fill=Churn))+ 
          geom_bar(position = 'fill')+theme_bw()+
          scale_x_discrete(labels = function(x) str_wrap(x, width = 10))

ggplot(dataset, aes(x=Contract,fill=Churn))+ 
          geom_bar(position = 'fill')+theme_bw()+
          scale_x_discrete(labels = function(x) str_wrap(x, width = 10))

ggplot(dataset, aes(x=PaperlessBilling,fill=Churn))+ 
          geom_bar(position = 'fill')+theme_bw()+
          scale_x_discrete(labels = function(x) str_wrap(x, width = 10))

ggplot(dataset, aes(x=PaymentMethod,fill=Churn))+
          geom_bar(position = 'fill')+theme_bw()+
          scale_x_discrete(labels = function(x) str_wrap(x, width = 10))

          align = "h"

Most of our variables are already a factor and not a character class. This is better for models. But one variable "Senior Citizen" is an integer with only two values. We transform this in a factor variable

dataset$SeniorCitizen <- as.factor(dataset$SeniorCitizen)

Missing Values

In a project of any size, data is likely to be incomplete , improperly coded/labeled data, etc. These values are represented by the symbol NA (not available) in R.

# To detect them is na(X) returns a boolean (TRUE if the observation is missing)
sapply(dataset, function(x) sum(is.na(x)))

##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

table(dataset$Churn, is.na(dataset$TotalCharges))

##      
##       FALSE TRUE
##   No   5163   11
##   Yes  1869    0

Dealing with missing values:

Deleting missing observations (“listwise deletion”). Recommended when the number of missing data is particularly small. Check that there are still sufficient data points and not to introduce a bias by deleting.
Deleting the variable: when a variable has a large number of missing value and it doesn’t have a sound meaning for the project, then it can be removed.
Fill missing data with reasonable/most likelihood values:

3.1 Median or mean imputation

3.2 Model Imputation, for example Knn Imputation

# We continue with the first way , we eliminate the rows with missing values 

dataset <- dataset[complete.cases(dataset), ]

#To impute with the mean 
# dataset <- dataset %>% 
#       mutate(TotalCharges = replace(TotalCharges,
#                                is.na(TotalCharges),mean(TotalCharges, na.rm = T)))

Two things to prepare for the final model

1. Remove columns unnecessary for the model : CustomerID, Total Charges

dataset <- dataset %>% select(-customerID, -TotalCharges)

1. Transform and analyze each variable in more detail in case that multiple transformation or dimension reduction has been performed.

3. Split the Dataset

For Cross-sectional data:
- 1. Train-Test Split
- 2. Cross-Validation (e.g. K-Fold Cross Validation)

library(caTools)
set.seed(10) # to ensure the reproducibility in the random numbers
# split <- createDataPartition(dataset$Churn, p=0.75, list=FALSE)
split = sample.split(dataset$Churn , SplitRatio = 0.8)
train <- dataset[split==TRUE,]
train_churn <- dataset[split==TRUE, "Churn"]

test <- dataset[split ==FALSE,]
test_churn <- dataset[split==TRUE, "Churn"]

4. Model the data

Always bear in mind the problem that you need to solve and start with a baseline model.
In our business case, it is easier to lose a client rather than to acquire a new client. Hence, it is important to analyze our models bearing this in mind.

1. Logistic Regression.

lm_model-Logistic-

fit1 <- glm(formula = Churn~., data=train, family=binomial)

# fit1 <- glm(formula = Churn~ gender + partner + tenure, data=train, family=binomial)

pred_logistic <- predict(fit1, test, type="response")
y_pred <- ifelse(pred_logistic > 0.5, "Yes", "No")

confusionMatrix(table(test$Churn, y_pred), positive = "Yes")

## Confusion Matrix and Statistics
## 
##      y_pred
##        No Yes
##   No  938  95
##   Yes 160 214
##                                           
##                Accuracy : 0.8188          
##                  95% CI : (0.7976, 0.8386)
##     No Information Rate : 0.7804          
##     P-Value [Acc > NIR] : 0.0002182       
##                                           
##                   Kappa : 0.5084          
##                                           
##  Mcnemar's Test P-Value : 6.128e-05       
##                                           
##             Sensitivity : 0.6926          
##             Specificity : 0.8543          
##          Pos Pred Value : 0.5722          
##          Neg Pred Value : 0.9080          
##              Prevalence : 0.2196          
##          Detection Rate : 0.1521          
##    Detection Prevalence : 0.2658          
##       Balanced Accuracy : 0.7734          
##                                           
##        'Positive' Class : Yes             
##

The output of the predict function returns probabilities , now we need to set the threshold. One (common) possibility is to set the threshold in prob = 0.5. But we can directly try to set the threshold subject to the minimization of cost for a hypothetical company

5. Minimize cost

Suppose that one month of our service is 100 euros. Then, the worst case scenario is that we lose a customer that we believe would not churn. Lets say that this customer cost us ten month worth of his payment due to hard competition in the industry. In addition, we would assume that we spend two month (200) of the customer fee per year only in retain the client.

FN (predict that a customer won’t churn, but they actually do): $1000
TP (predict that a customer would churn, when they actually would): $200
FP (predict that a customer would churn, when they actually wouldn’t): $200
TN (predict that a customer won’t churn, when they actually wouldn’t): $0

type_error

What we will try to do is to set different threshold that will change the confusion matrix and then choose the one that minimizes the cost for our company.

th <- seq(0.1,1.0, length = 10)
total_cost = rep(0,length(th))
for (i in 1:length(th)){
      
      pred = rep("No", length(pred_logistic))
      pred[pred_logistic > th[i]] = "Yes"
      pred <- as.factor(pred)
      conf <- confusionMatrix(pred, test$Churn, positive = "Yes")
      TN <- conf$table[1]
      FP <- conf$table[2]
      FN <- conf$table[3]
      TP <- conf$table[4]
      total_cost[i] = (FN*1000 + TP*200 + FP*200 + TN*0)/1000
}

library(ggplot2)
library(plotly)

dt <- data.frame(th, total_cost)
my_chart <- ggplot(dt, aes(x = th, y = total_cost)) +
  geom_line() +
  geom_point() +
  theme()
ggplotly(my_chart)

Next Steps:
- 1. Deploy the model
- 2. Evaluate and Reporting
- 3. Documentation and OOP
- 4. Feature importance/Feature Selection analysis.

References

The logistic short post can be found in : SaedSayad

For more technical knowledge:
Introduction to Statistical Learning in R website

Data Science for business, Provost & Fawcett : Data Science for Business