-Business CRISP-DM-
One of the powerful features of R/Python is the possibility to use and combine data from Excel, CSV, SAS, SPSS, SQL, JSON, API, etc.
What is Exploratory Data Analysis (EDA)?
How to ensure you are ready to use machine learning algorithms in a project? How to choose the most suitable algorithms for your data set? How to define the feature variables that can potentially be used for machine learning?
Exploratory Data Analysis (EDA) helps to answer all these questions, ensuring the best outcomes for the project. It is an approach for summarizing, visualizing, and becoming intimately familiar with the important characteristics of a data set.
It is always better to explore each data set using multiple exploratory techniques and compare the results. Once the data set is fully understood, it is quite possible that data scientist will have to go back to data collection and cleansing phases in order to transform the data set according to the desired business outcomes. The goal of this step is to become confident that the data set is ready to be used in a machine learning algorithm.
Through these methods, the data scientist validates assumptions and identifies patterns that will allow for the understanding of the problem and model selection and validates that the data has been generated in the way it was expected to. So, value distribution of each field is checked, a number of missing values is defined, and the possible ways of replacing them are found. Key Concepts of Exploratory Data Analysis
2 types of Data Analysis
Confirmatory Data Analysis
Exploratory Data Analysis
4 Objectives of EDA
Discover Patterns
Spot Anomalies
Frame Hypothesis
Check Assumptions
Stuff done during EDA
Trends
Distribution
Mean
Median
Outlier
Spread measurement (SD)
Correlations
Hypothesis testing
Visual Exploration
## Rows: 7,043
## Columns: 21
## $ customerID <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237…
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Fema…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Y…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, N…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, …
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, …
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Yes,…
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber …
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No i…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No i…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No int…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No in…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No int…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, …
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, N…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank t…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 194…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, …
So, we have 7043 observation with 21 variables with a mix of discrete and continous variables for each customer. Our dependent variable is "Churn" which is a binary (YES/NO) variable
## customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3488 Min. :0.0000 No :3641 No :4933
## 0003-MKNFE: 1 Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110
## 0004-TLHLJ: 1 Median :0.0000
## 0011-IGKFF: 1 Mean :0.1621
## 0013-EXCHZ: 1 3rd Qu.:0.0000
## 0013-MHZWF: 1 Max. :1.0000
## (Other) :7037
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies Contract
## No :2810 No :2785 Month-to-month:3875
## No internet service:1526 No internet service:1526 One year :1473
## Yes :2707 Yes :2732 Two year :1695
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges
## No :2872 Bank transfer (automatic):1544 Min. : 18.25
## Yes:4171 Credit card (automatic) :1522 1st Qu.: 35.50
## Electronic check :2365 Median : 70.35
## Mailed check :1612 Mean : 64.76
## 3rd Qu.: 89.85
## Max. :118.75
##
## TotalCharges Churn
## Min. : 18.8 No :5174
## 1st Qu.: 401.4 Yes:1869
## Median :1397.5
## Mean :2283.3
## 3rd Qu.:3794.7
## Max. :8684.8
## NA's :11
options(repr.plot.width =6, repr.plot.height = 2)
ggplot(dataset, aes(y= tenure, x = "", fill = Churn)) +
geom_boxplot()+
theme_test()+
xlab(" ")ggplot(dataset, aes(x=StreamingMovies,fill=Churn))+
geom_bar(position = 'fill')+theme_bw()+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))ggplot(dataset, aes(x=Contract,fill=Churn))+
geom_bar(position = 'fill')+theme_bw()+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))ggplot(dataset, aes(x=PaperlessBilling,fill=Churn))+
geom_bar(position = 'fill')+theme_bw()+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))ggplot(dataset, aes(x=PaymentMethod,fill=Churn))+
geom_bar(position = 'fill')+theme_bw()+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))Most of our variables are already a factor and not a character class. This is better for models. But one variable "Senior Citizen" is an integer with only two values. We transform this in a factor variable
In a project of any size, data is likely to be incomplete , improperly coded/labeled data, etc. These values are represented by the symbol NA (not available) in R.
# To detect them is na(X) returns a boolean (TRUE if the observation is missing)
sapply(dataset, function(x) sum(is.na(x)))## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
##
## FALSE TRUE
## No 5163 11
## Yes 1869 0
Dealing with missing values:
Deleting missing observations (“listwise deletion”). Recommended when the number of missing data is particularly small. Check that there are still sufficient data points and not to introduce a bias by deleting.
Deleting the variable: when a variable has a large number of missing value and it doesn’t have a sound meaning for the project, then it can be removed.
Fill missing data with reasonable/most likelihood values:
3.1 Median or mean imputation
3.2 Model Imputation, for example Knn Imputation
# We continue with the first way , we eliminate the rows with missing values
dataset <- dataset[complete.cases(dataset), ]
#To impute with the mean
# dataset <- dataset %>%
# mutate(TotalCharges = replace(TotalCharges,
# is.na(TotalCharges),mean(TotalCharges, na.rm = T)))Two things to prepare for the final model
For Cross-sectional data:
- 1. Train-Test Split
- 2. Cross-Validation (e.g. K-Fold Cross Validation)
library(caTools)
set.seed(10) # to ensure the reproducibility in the random numbers
# split <- createDataPartition(dataset$Churn, p=0.75, list=FALSE)
split = sample.split(dataset$Churn , SplitRatio = 0.8)
train <- dataset[split==TRUE,]
train_churn <- dataset[split==TRUE, "Churn"]
test <- dataset[split ==FALSE,]
test_churn <- dataset[split==TRUE, "Churn"]Always bear in mind the problem that you need to solve and start with a baseline model.
In our business case, it is easier to lose a client rather than to acquire a new client. Hence, it is important to analyze our models bearing this in mind.
lm_model-Logistic-
fit1 <- glm(formula = Churn~., data=train, family=binomial)
# fit1 <- glm(formula = Churn~ gender + partner + tenure, data=train, family=binomial)
pred_logistic <- predict(fit1, test, type="response")
y_pred <- ifelse(pred_logistic > 0.5, "Yes", "No")
confusionMatrix(table(test$Churn, y_pred), positive = "Yes")## Confusion Matrix and Statistics
##
## y_pred
## No Yes
## No 938 95
## Yes 160 214
##
## Accuracy : 0.8188
## 95% CI : (0.7976, 0.8386)
## No Information Rate : 0.7804
## P-Value [Acc > NIR] : 0.0002182
##
## Kappa : 0.5084
##
## Mcnemar's Test P-Value : 6.128e-05
##
## Sensitivity : 0.6926
## Specificity : 0.8543
## Pos Pred Value : 0.5722
## Neg Pred Value : 0.9080
## Prevalence : 0.2196
## Detection Rate : 0.1521
## Detection Prevalence : 0.2658
## Balanced Accuracy : 0.7734
##
## 'Positive' Class : Yes
##
The output of the predict function returns probabilities , now we need to set the threshold. One (common) possibility is to set the threshold in prob = 0.5. But we can directly try to set the threshold subject to the minimization of cost for a hypothetical company
Suppose that one month of our service is 100 euros. Then, the worst case scenario is that we lose a customer that we believe would not churn. Lets say that this customer cost us ten month worth of his payment due to hard competition in the industry. In addition, we would assume that we spend two month (200) of the customer fee per year only in retain the client.
type_error
What we will try to do is to set different threshold that will change the confusion matrix and then choose the one that minimizes the cost for our company.
th <- seq(0.1,1.0, length = 10)
total_cost = rep(0,length(th))
for (i in 1:length(th)){
pred = rep("No", length(pred_logistic))
pred[pred_logistic > th[i]] = "Yes"
pred <- as.factor(pred)
conf <- confusionMatrix(pred, test$Churn, positive = "Yes")
TN <- conf$table[1]
FP <- conf$table[2]
FN <- conf$table[3]
TP <- conf$table[4]
total_cost[i] = (FN*1000 + TP*200 + FP*200 + TN*0)/1000
}library(ggplot2)
library(plotly)
dt <- data.frame(th, total_cost)
my_chart <- ggplot(dt, aes(x = th, y = total_cost)) +
geom_line() +
geom_point() +
theme()
ggplotly(my_chart)Next Steps:
- 1. Deploy the model
- 2. Evaluate and Reporting
- 3. Documentation and OOP
- 4. Feature importance/Feature Selection analysis.
The logistic short post can be found in : SaedSayad
For more technical knowledge:
Introduction to Statistical Learning in R website
Data Science for business, Provost & Fawcett : Data Science for Business