One of the powerful features of R is the possibility to use data from Excel, CSV, SAS, SPSS, SQL, JSON, API, etc.
Two of the most important ones:
Reading from csv files:
Reading directly from “internet”
library(dplyr)
library(caret)
setwd("/Users/martinliendo/Documents/Proyectos/elective_meba")
path <- "https://raw.githubusercontent.com/dataoptimal/posts/master/business%20impact%20project/Telco%20Data.csv"
dataset <- read.csv(path)
glimpse(dataset)
## Observations: 7,043
## Variables: 21
## $ customerID <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 92...
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Fe...
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No,...
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No,...
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49...
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes...
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Ye...
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fibe...
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, ...
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No...
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No...
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No i...
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No ...
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No i...
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year...
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes,...
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank...
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 2...
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1...
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No...
So, we have 7043 observation with 21 variables with a mix of discrete and continous variables for each customer. Our dependent variable is "Churn" which is a binary (YES/NO) variable
summary(dataset)
## customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3488 Min. :0.0000 No :3641 No :4933
## 0003-MKNFE: 1 Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110
## 0004-TLHLJ: 1 Median :0.0000
## 0011-IGKFF: 1 Mean :0.1621
## 0013-EXCHZ: 1 3rd Qu.:0.0000
## 0013-MHZWF: 1 Max. :1.0000
## (Other) :7037
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies Contract
## No :2810 No :2785 Month-to-month:3875
## No internet service:1526 No internet service:1526 One year :1473
## Yes :2707 Yes :2732 Two year :1695
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges
## No :2872 Bank transfer (automatic):1544 Min. : 18.25
## Yes:4171 Credit card (automatic) :1522 1st Qu.: 35.50
## Electronic check :2365 Median : 70.35
## Mailed check :1612 Mean : 64.76
## 3rd Qu.: 89.85
## Max. :118.75
##
## TotalCharges Churn
## Min. : 18.8 No :5174
## 1st Qu.: 401.4 Yes:1869
## Median :1397.5
## Mean :2283.3
## 3rd Qu.:3794.7
## Max. :8684.8
## NA's :11
Most of our variables are already a factor and not a character class. This is better for models. But one variable "Senior Citizen" is an integer with only two values. We transform this in a factor variable
dataset$SeniorCitizen <- as.factor(dataset$SeniorCitizen)
In a project of any size, data is likely to be incomplete , improperly coded/labeled data, etc. These values are represented by the symbol NA (not available) in R.
To detect them is na(X) returns a boolean (TRUE if the observation is missing)
sapply(dataset, function(x) sum(is.na(x)))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
table(dataset$Churn, is.na(dataset$TotalCharges))
##
## FALSE TRUE
## No 5163 11
## Yes 1869 0
Dealing with missing values:
Deleting missing observations (“listwise deletion”). Recommended when the number of missing data is particularly small. Check that there are still sufficient data points and not to introduce a bias by deleting.
Deleting the variable: when a variable has a large number of missing value and it doesn’t have a sound meaning for the project, then it can be removed.
Fill missing data with reasonable possibles values:
3.1 Median or mean imputation
3.2 Model Imputation, for example Knn Imputation
# We continue with the first way , we eliminate the rows with missing values
dataset <- dataset[complete.cases(dataset), ]
#To impute with the mean
# dataset <- dataset %>%
# mutate(TotalCharges = replace(TotalCharges,
# is.na(TotalCharges),mean(TotalCharges, na.rm = T)))
Two things to prepare for the final model
dataset <- dataset %>% select(-customerID, -TotalCharges)
library(caTools)
set.seed(10) # to ensure the reproducibility in the random numbers
# split <- createDataPartition(dataset$Churn, p=0.75, list=FALSE)
split = sample.split(dataset$Churn , SplitRatio = 0.8)
train <- dataset[split==TRUE,]
train_churn <- dataset[split==TRUE, "Churn"]
test <- dataset[split ==FALSE,]
test_churn <- dataset[split==TRUE, "Churn"]
Always bear in mind the problem that you need to solve. In today’s markets it is very common that it is easier to lose a client rather than to acquire a new client. Hence, it is important to analyze our existing clients.
What is the problem that we are trying to solve here?.
lm_model-Logistic-
fit1 <- glm(formula = Churn~., data=train, family=binomial)
pred_logistic <- predict(fit1, test, type="response")
y_pred <- ifelse(pred_logistic > 0.5, "Yes", "No")
confusionMatrix(test$Churn, y_pred, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 938 95
## Yes 160 214
##
## Accuracy : 0.8188
## 95% CI : (0.7976, 0.8386)
## No Information Rate : 0.7804
## P-Value [Acc > NIR] : 0.0002182
##
## Kappa : 0.5084
## Mcnemar's Test P-Value : 6.128e-05
##
## Sensitivity : 0.6926
## Specificity : 0.8543
## Pos Pred Value : 0.5722
## Neg Pred Value : 0.9080
## Prevalence : 0.2196
## Detection Rate : 0.1521
## Detection Prevalence : 0.2658
## Balanced Accuracy : 0.7734
##
## 'Positive' Class : Yes
##
The output of the predict function returns probabilities , now we need to set the threshold. One (common) possibility is to set the threshold in prob = 0.5. But we can directly try to set the threshold subject to the minimization of cost for a hypothetical company
Suppose that one month of our service is 100 euros. Then, the worst case scenario is that we lose a customer that we believe would not churn. Lets say that this customer cost us ten month worth of his payment due to hard competition in the industry. In addition, we would assume that we spend two month (200) of the customer fee per year only in retain the client.
type_error
What we will try to do is to set different threshold that will change the confusion matrix and then choose the one that minimizes the cost for our company.
th <- seq(0.1,1.0, length = 10)
total_cost = rep(0,length(th))
for (i in 1:length(th)){
pred = rep("No", length(pred_logistic))
pred[pred_logistic > th[i]] = "Yes"
pred <- as.factor(pred)
conf <- confusionMatrix(pred, test$Churn, positive = "Yes")
TN <- conf$table[1]
FP <- conf$table[2]
FN <- conf$table[3]
TP <- conf$table[4]
total_cost[i] = (FN*1000 + TP*200 + FP*200 + TN*0)/1000
}
library(ggplot2)
library(plotly)
dt <- data.frame(th, total_cost)
my_chart <- ggplot(dt, aes(x = th, y = total_cost)) +
geom_line() +
geom_point() +
theme()
ggplotly(my_chart)
Let’s discuss the findings!!
The core of this class is base in this post:
TowardsDataScience
The logistic short post can be found in :
SaedSayad
For more technical knowledge:
Introduction to Statistical Learning in R website