Telco es una empresa internacional que vende una variedad de servicios de telecomunicaciones. Churn es un indicador de la deserción de clientes, fenómeno en el cual los clientes cancelan sus servicios con la empresa. El objetivo de este trabajo es predecir si un cliente abandonará o se quedará en la empresa.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Cargando paquete requerido: lattice
##
## Adjuntando el paquete: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
## customerID gender SeniorCitizen Partner
## Length:7043 Length:7043 Min. :0.0000 Length:7043
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.1621
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Dependents tenure PhoneService MultipleLines
## Length:7043 Min. : 0.00 Length:7043 Length:7043
## Class :character 1st Qu.: 9.00 Class :character Class :character
## Mode :character Median :29.00 Mode :character Mode :character
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TechSupport StreamingTV StreamingMovies Contract
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## Length:7043 Length:7043 Min. : 18.25 Min. : 18.8
## Class :character Class :character 1st Qu.: 35.50 1st Qu.: 401.4
## Mode :character Mode :character Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
## Churn
## Length:7043
## Class :character
## Mode :character
##
##
##
##
La regresión logística en un modelo que se utiliza para predecir la probabilidad de que ocurra un evento, basado en una o más variables independientes. A diferencia de la regresión logística que predice valores continuos, la regrsión logística predecir resultados binarios o categóricos, como un Si/No, 1/0, verdadero/falso.
Si la probabilidad es mayor que 0.5, el modelo predice que el evento ocurrirá (Ej. el cliente se dará de baja). Si la probabilidad es menor que 0.5, predice que el evento no ocurrirá. En caso de ser 0.5, se conoce como umbral de decisión.
# Drop unecessary columns
cleaned_df <- df %>% select(-customerID)
# Convert categorical into factor
cleaned_df <- cleaned_df %>% mutate(across(where(is.character), as.factor))
str(cleaned_df)## 'data.frame': 7043 obs. of 20 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
set.seed(123)
train_rows <- createDataPartition(cleaned_df$Churn, p = 0.8, list = FALSE)
train <- cleaned_df[train_rows, ]
test <- cleaned_df[-train_rows, ]##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.820e+00 9.035e-01 2.014 0.04398 *
## genderMale -3.227e-02 7.248e-02 -0.445 0.65620
## SeniorCitizen 2.636e-01 9.426e-02 2.797 0.00516 **
## PartnerYes -3.523e-02 8.694e-02 -0.405 0.68533
## DependentsYes -8.790e-02 1.008e-01 -0.872 0.38300
## tenure -6.307e-02 7.049e-03 -8.947 < 2e-16 ***
## PhoneServiceYes 6.129e-01 7.190e-01 0.852 0.39398
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 5.201e-01 1.965e-01 2.647 0.00811 **
## InternetServiceFiber optic 2.320e+00 8.861e-01 2.618 0.00884 **
## InternetServiceNo -2.354e+00 8.936e-01 -2.634 0.00844 **
## OnlineSecurityNo internet service NA NA NA NA
## OnlineSecurityYes -9.074e-02 1.977e-01 -0.459 0.64625
## OnlineBackupNo internet service NA NA NA NA
## OnlineBackupYes 1.376e-01 1.949e-01 0.706 0.48017
## DeviceProtectionNo internet service NA NA NA NA
## DeviceProtectionYes 2.800e-01 1.960e-01 1.428 0.15317
## TechSupportNo internet service NA NA NA NA
## TechSupportYes -2.619e-02 2.007e-01 -0.130 0.89619
## StreamingTVNo internet service NA NA NA NA
## StreamingTVYes 8.437e-01 3.614e-01 2.334 0.01958 *
## StreamingMoviesNo internet service NA NA NA NA
## StreamingMoviesYes 8.890e-01 3.630e-01 2.449 0.01433 *
## ContractOne year -6.819e-01 1.202e-01 -5.673 1.41e-08 ***
## ContractTwo year -1.450e+00 1.981e-01 -7.320 2.49e-13 ***
## PaperlessBillingYes 3.523e-01 8.316e-02 4.236 2.27e-05 ***
## PaymentMethodCredit card (automatic) 8.669e-04 1.266e-01 0.007 0.99454
## PaymentMethodElectronic check 2.757e-01 1.052e-01 2.620 0.00878 **
## PaymentMethodMailed check -1.337e-01 1.278e-01 -1.046 0.29562
## MonthlyCharges -6.431e-02 3.521e-02 -1.827 0.06775 .
## TotalCharges 3.522e-04 7.966e-05 4.421 9.81e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6517.2 on 5626 degrees of freedom
## Residual deviance: 4660.2 on 5603 degrees of freedom
## AIC: 4708.2
##
## Number of Fisher Scoring iterations: 6
# Confusion matrix for training data
train_result <- predict(model, train, type = 'response')
train_result_ <- ifelse(train_result >= 0.5, 'Yes', 'No')
train_confusion <- confusionMatrix(factor(train_result_, levels = c('Yes', 'No')), train$Churn)## Warning in confusionMatrix.default(factor(train_result_, levels = c("Yes", :
## Levels are not in the same order for reference and data. Refactoring data to
## match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3707 672
## Yes 424 824
##
## Accuracy : 0.8052
## 95% CI : (0.7946, 0.8155)
## No Information Rate : 0.7341
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4732
##
## Mcnemar's Test P-Value : 8.593e-14
##
## Sensitivity : 0.8974
## Specificity : 0.5508
## Pos Pred Value : 0.8465
## Neg Pred Value : 0.6603
## Prevalence : 0.7341
## Detection Rate : 0.6588
## Detection Prevalence : 0.7782
## Balanced Accuracy : 0.7241
##
## 'Positive' Class : No
##
# Confusion matrix for test data
test_result <- predict(model, test, type= 'response')
test_result_ <- ifelse(test_result >= 0.5, 'Yes', 'No')
test_confusion <- confusionMatrix(factor(test_result_, levels = c('Yes', 'No')), test$Churn)## Warning in confusionMatrix.default(factor(test_result_, levels = c("Yes", :
## Levels are not in the same order for reference and data. Refactoring data to
## match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 928 156
## Yes 104 217
##
## Accuracy : 0.8149
## 95% CI : (0.7936, 0.8349)
## No Information Rate : 0.7345
## P-Value [Acc > NIR] : 8.78e-13
##
## Kappa : 0.5034
##
## Mcnemar's Test P-Value : 0.001562
##
## Sensitivity : 0.8992
## Specificity : 0.5818
## Pos Pred Value : 0.8561
## Neg Pred Value : 0.6760
## Prevalence : 0.7345
## Detection Rate : 0.6605
## Detection Prevalence : 0.7715
## Balanced Accuracy : 0.7405
##
## 'Positive' Class : No
##
client_info <- cleaned_df[1, ]
client_info <- client_info %>% select(-Churn)
probabilities <- predict(model, client_info, type = 'response')
probabilities_cat <- ifelse(probabilities >= 0.5, 'Yes', 'No')
probabilities_cat## 1
## "Yes"
## 'data.frame': 7032 obs. of 20 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:11] 489 754 937 1083 1341 3332 3827 4381 5219 6671 ...
## ..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...