Análisis Exploratorio de Datos

##Introducción Los servicios basados en suscripción generalmente hacen dinero de las siguientes tres maneras:

Adquirir nuevos clientes
Upsell de clientes
Retener clientes existentes

En este análisis me voy a centrar en el AED del dataset Telco Churn.

Preparación de los datos

El conjunto de datos con el que voy a trabajar se puede encontrar en el sitio web de IBM Watson Analytics: https://www.ibm.com/communities/analytics/watson-analytics-blog/guide-to-sample-datasets/ .

Este es un conjunto de datos relacionado a una empresa de telecomunicaciones. Podemos comenzar echando un vistazo a las dimensiones de los datos, así como a las diferentes variables.

Librerias

En primer lugar validaremos si las librerías que usaremos están instaladas y activadas para ser utilizadas en esta sesión.

if (!require("tidyverse"))
{ install.packages("tidyverse")
library(tidyverse)}

if (!require("miscset"))
{ install.packages("miscset")
library(miscset)}

if (!require("car"))
{ install.packages("car")
library(car)}

if (!require("Hmisc"))
{ install.packages("Hmisc")
library(Hmisc)}

Estructura de los datos

# Setting the working directory
path_loc <- "C:\\TEMPORAL\\6. ESTADÍSTICA\\3. R Studio"
setwd(path_loc)

# reading in the data
df <- read_csv("Telco.csv")

## Rows: 7043 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
## dbl  (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# dimensions of the data
dim_desc(df)

## [1] "[7,043 x 21]"

# names of the data
names(df)

##  [1] "customerID"       "gender"           "SeniorCitizen"    "Partner"         
##  [5] "Dependents"       "tenure"           "PhoneService"     "MultipleLines"   
##  [9] "InternetService"  "OnlineSecurity"   "OnlineBackup"     "DeviceProtection"
## [13] "TechSupport"      "StreamingTV"      "StreamingMovies"  "Contract"        
## [17] "PaperlessBilling" "PaymentMethod"    "MonthlyCharges"   "TotalCharges"    
## [21] "Churn"

Echando un vistazo, vemos que hay 21 variables y 7043 observaciones. La variable de destino que intentaremos predecir es “Churn”. Podemos profundizar un poco más y echar un vistazo a los tipos de datos de las variables.

# data types
glimpse(df)

## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…

Conversión de variables

Por ahora, comencemos transformando las variables tipo “chr”, así como la variable “SeniorCitizen”, en tipo factor “fct”.

df <- df %>% mutate_if(is.character, as.factor)
df$SeniorCitizen <- as.factor(df$SeniorCitizen)
glimpse(df)

## Rows: 7,043
## Columns: 21
## $ customerID       <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237-…
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…

Valores perdidos

Ahora busquemos valores perdidos.

sapply(df, function(x) sum(is.na(x)))

##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

df %>% map(~ sum(is.na(.)))

## $customerID
## [1] 0
## 
## $gender
## [1] 0
## 
## $SeniorCitizen
## [1] 0
## 
## $Partner
## [1] 0
## 
## $Dependents
## [1] 0
## 
## $tenure
## [1] 0
## 
## $PhoneService
## [1] 0
## 
## $MultipleLines
## [1] 0
## 
## $InternetService
## [1] 0
## 
## $OnlineSecurity
## [1] 0
## 
## $OnlineBackup
## [1] 0
## 
## $DeviceProtection
## [1] 0
## 
## $TechSupport
## [1] 0
## 
## $StreamingTV
## [1] 0
## 
## $StreamingMovies
## [1] 0
## 
## $Contract
## [1] 0
## 
## $PaperlessBilling
## [1] 0
## 
## $PaymentMethod
## [1] 0
## 
## $MonthlyCharges
## [1] 0
## 
## $TotalCharges
## [1] 11
## 
## $Churn
## [1] 0

Imputando variables

Parece que “TotalCharges” es la única característica con valores perdidos. Vamos a seguir e imputar los 11 valores perdidos utilizando el valor medio.

# imputing with the median
df <- df %>% 
  mutate(TotalCharges = replace(TotalCharges,
                                is.na(TotalCharges),
                                median(TotalCharges, na.rm = T)))

# checking that the imputation worked
sum(is.na(df$TotalCharges))

## [1] 0

Recodificando variables

summary(df)

##       customerID      gender     SeniorCitizen Partner    Dependents
##  0002-ORFBO:   1   Female:3488   0:5901        No :3641   No :4933  
##  0003-MKNFE:   1   Male  :3555   1:1142        Yes:3402   Yes:2110  
##  0004-TLHLJ:   1                                                    
##  0011-IGKFF:   1                                                    
##  0013-EXCHZ:   1                                                    
##  0013-MHZWF:   1                                                    
##  (Other)   :7037                                                    
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.37                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies           Contract   
##  No                 :2810   No                 :2785   Month-to-month:3875  
##  No internet service:1526   No internet service:1526   One year      :1473  
##  Yes                :2707   Yes                :2732   Two year      :1695  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling                   PaymentMethod  MonthlyCharges  
##  No :2872         Bank transfer (automatic):1544   Min.   : 18.25  
##  Yes:4171         Credit card (automatic)  :1522   1st Qu.: 35.50  
##                   Electronic check         :2365   Median : 70.35  
##                   Mailed check             :1612   Mean   : 64.76  
##                                                    3rd Qu.: 89.85  
##                                                    Max.   :118.75  
##                                                                    
##   TotalCharges    Churn     
##  Min.   :  18.8   No :5174  
##  1st Qu.: 402.2   Yes:1869  
##  Median :1397.5             
##  Mean   :2281.9             
##  3rd Qu.:3786.6             
##  Max.   :8684.8             
##

df$MultipleLines[df$MultipleLines=="No phone service"] <- "No"
df$OnlineSecurity[df$OnlineSecurity=="No internet service"] <- "No"
df$OnlineBackup[df$OnlineBackup=="No internet service"] <- "No"
df$DeviceProtection[df$DeviceProtection=="No internet service"] <- "No"
df$TechSupport[df$TechSupport=="No internet service"] <- "No"
df$StreamingTV[df$StreamingTV=="No internet service"] <- "No"
df$StreamingMovies[df$StreamingMovies=="No internet service"] <- "No"
describe(df)

## df 
## 
##  21  Variables      7043  Observations
## --------------------------------------------------------------------------------
## customerID 
##        n  missing distinct 
##     7043        0     7043 
## 
## lowest : 0002-ORFBO 0003-MKNFE 0004-TLHLJ 0011-IGKFF 0013-EXCHZ
## highest: 9987-LUTYD 9992-RRAMN 9992-UJOEL 9993-LHIEB 9995-HOTOH
## --------------------------------------------------------------------------------
## gender 
##        n  missing distinct 
##     7043        0        2 
##                         
## Value      Female   Male
## Frequency    3488   3555
## Proportion  0.495  0.505
## --------------------------------------------------------------------------------
## SeniorCitizen 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value          0     1
## Frequency   5901  1142
## Proportion 0.838 0.162
## --------------------------------------------------------------------------------
## Partner 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   3641  3402
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## Dependents 
##        n  missing distinct 
##     7043        0        2 
##                     
## Value        No  Yes
## Frequency  4933 2110
## Proportion  0.7  0.3
## --------------------------------------------------------------------------------
## tenure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     7043        0       73    0.999    32.37    28.08        1        2 
##      .25      .50      .75      .90      .95 
##        9       29       55       69       72 
## 
## lowest :  0  1  2  3  4, highest: 68 69 70 71 72
## --------------------------------------------------------------------------------
## PhoneService 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency    682  6361
## Proportion 0.097 0.903
## --------------------------------------------------------------------------------
## MultipleLines 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   4072  2971
## Proportion 0.578 0.422
## --------------------------------------------------------------------------------
## InternetService 
##        n  missing distinct 
##     7043        0        3 
##                                               
## Value              DSL Fiber optic          No
## Frequency         2421        3096        1526
## Proportion       0.344       0.440       0.217
## --------------------------------------------------------------------------------
## OnlineSecurity 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   5024  2019
## Proportion 0.713 0.287
## --------------------------------------------------------------------------------
## OnlineBackup 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   4614  2429
## Proportion 0.655 0.345
## --------------------------------------------------------------------------------
## DeviceProtection 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   4621  2422
## Proportion 0.656 0.344
## --------------------------------------------------------------------------------
## TechSupport 
##        n  missing distinct 
##     7043        0        2 
##                     
## Value        No  Yes
## Frequency  4999 2044
## Proportion 0.71 0.29
## --------------------------------------------------------------------------------
## StreamingTV 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   4336  2707
## Proportion 0.616 0.384
## --------------------------------------------------------------------------------
## StreamingMovies 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   4311  2732
## Proportion 0.612 0.388
## --------------------------------------------------------------------------------
## Contract 
##        n  missing distinct 
##     7043        0        3 
##                                                        
## Value      Month-to-month       One year       Two year
## Frequency            3875           1473           1695
## Proportion          0.550          0.209          0.241
## --------------------------------------------------------------------------------
## PaperlessBilling 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   2872  4171
## Proportion 0.408 0.592
## --------------------------------------------------------------------------------
## PaymentMethod 
##        n  missing distinct 
##     7043        0        4 
##                                                               
## Value      Bank transfer (automatic)   Credit card (automatic)
## Frequency                       1544                      1522
## Proportion                     0.219                     0.216
##                                                               
## Value               Electronic check              Mailed check
## Frequency                       2365                      1612
## Proportion                     0.336                     0.229
## --------------------------------------------------------------------------------
## MonthlyCharges 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     7043        0     1585        1    64.76    34.39    19.65    20.05 
##      .25      .50      .75      .90      .95 
##    35.50    70.35    89.85   102.60   107.40 
## 
## lowest :  18.25  18.40  18.55  18.70  18.75, highest: 118.20 118.35 118.60 118.65 118.75
## --------------------------------------------------------------------------------
## TotalCharges 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     7043        0     6531        1     2282     2447    49.65    84.61 
##      .25      .50      .75      .90      .95 
##   402.23  1397.47  3786.60  5973.69  6921.02 
## 
## lowest :   18.80   18.85   18.90   19.00   19.05
## highest: 8564.75 8594.40 8670.10 8672.45 8684.80
## --------------------------------------------------------------------------------
## Churn 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   5174  1869
## Proportion 0.735 0.265
## --------------------------------------------------------------------------------

Explorando los datos

Ahora que hemos importado los datos y realizado una limpieza, comencemos a explorar los datos. Comencemos por observar los valores únicos de las variables factor.

df_tbl <- df %>% 
  select_if(is.factor) %>% 
  summarise_all(n_distinct)


df_tbl[1:8] %>% 
  print(width = Inf)

## # A tibble: 1 × 8
##   customerID gender SeniorCitizen Partner Dependents PhoneService MultipleLines
##        <int>  <int>         <int>   <int>      <int>        <int>         <int>
## 1       7043      2             2       2          2            2             2
##   InternetService
##             <int>
## 1               3

df_tbl[9:15] %>% 
  print(width = Inf)

## # A tibble: 1 × 7
##   OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV
##            <int>        <int>            <int>       <int>       <int>
## 1              2            2                2           2           2
##   StreamingMovies Contract
##             <int>    <int>
## 1               2        3

df_tbl[16:18] %>% 
  print(width = Inf)

## # A tibble: 1 × 3
##   PaperlessBilling PaymentMethod Churn
##              <int>         <int> <int>
## 1                2             4     2

Hay un valor único para cada “ID de cliente”. Todos los demás factores tienen cuatro o menos valores únicos, por lo que serán bastante manejables.

Para guiar el análisis, intentaré responder las siguientes preguntas sobre mis segmentos de clientes:

¿Son los hombres más propensos a abandonar que las mujeres?
¿A los ciudadanos mayores les gusta más abandonar?
¿Las personas con pareja tienden a abandonar más que las que no tienen pareja?
¿Las personas con dependientes tienden a abandonar más que las personas que no tienen dependientes?

Comenzaré con el género. No esperaría que un género sea más propenso que otro a abandonar, pero veamos qué muestran los datos.

#gender
ggplot(data = df) +
  geom_bar(aes(x = gender, fill = gender))

ggplot(df) + 
  geom_bar(aes(x = gender, fill = Churn), position = "dodge")

Echando un vistazo, los resultados son similares. Aproximadamente una cuarta parte de los clientes masculinos abandonan, y aproximadamente una cuarta parte de los clientes femeninos abandonan. También podemos ver exactamente cuántas personas de cada género se removieron.

df %>% 
  group_by(gender) %>% 
  summarise(n=n()) %>%
  mutate(freq = n / sum(n))

## # A tibble: 2 × 3
##   gender     n  freq
##   <fct>  <int> <dbl>
## 1 Female  3488 0.495
## 2 Male    3555 0.505

df %>% 
  group_by(gender,Churn) %>% 
  summarise(n=n()) %>%
  mutate(freq = n / sum(n))

## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.

## # A tibble: 4 × 4
## # Groups:   gender [2]
##   gender Churn     n  freq
##   <fct>  <fct> <int> <dbl>
## 1 Female No     2549 0.731
## 2 Female Yes     939 0.269
## 3 Male   No     2625 0.738
## 4 Male   Yes     930 0.262

A continuación, echaré un vistazo a las personas mayores.

#SeniorCitizen
ggplot(data = df) +
  geom_bar(aes(x = SeniorCitizen, fill = SeniorCitizen))

ggplot(df) + 
  geom_bar(aes(x = SeniorCitizen, fill = Churn), position = "dodge")

df %>% 
  group_by(SeniorCitizen) %>% 
  summarise(n = n()) %>% 
  mutate(freq = n / sum(n))

## # A tibble: 2 × 3
##   SeniorCitizen     n  freq
##   <fct>         <int> <dbl>
## 1 0              5901 0.838
## 2 1              1142 0.162

df %>% 
  group_by(SeniorCitizen, Churn) %>% 
  summarise(n = n()) %>% 
  mutate(freq = n / sum(n))

## `summarise()` has grouped output by 'SeniorCitizen'. You can override using the
## `.groups` argument.

## # A tibble: 4 × 4
## # Groups:   SeniorCitizen [2]
##   SeniorCitizen Churn     n  freq
##   <fct>         <fct> <int> <dbl>
## 1 0             No     4508 0.764
## 2 0             Yes    1393 0.236
## 3 1             No      666 0.583
## 4 1             Yes     476 0.417

Esta variable muestra una relación mucho más significativa. Aproximadamente el 16% de los clientes son ciudadanos de la tercera edad, y aproximadamente el 42% de los ciudadanos mayores abandonan. Por otro lado, del 84% de los clientes que no son personas mayores, solo el 24% abandona. Estos resultados muestran que las personas mayores tienen muchas más probabilidades de abandonar.

Ahora voy a echar un vistazo a las personas con pareja.

#Partner
ggplot(data = df) +
  geom_bar(aes(x = Partner, fill = Partner))

ggplot(df) +
  geom_bar(aes(x=Partner, fill = Churn), position = "dodge")

df %>% 
  group_by(Partner) %>% 
  summarise(n = n()) %>% 
  mutate(freq = n / sum(n))

## # A tibble: 2 × 3
##   Partner     n  freq
##   <fct>   <int> <dbl>
## 1 No       3641 0.517
## 2 Yes      3402 0.483

df %>% 
  group_by(Partner, Churn) %>% 
  summarise(n = n()) %>% 
  mutate(freq = n / sum(n))

## `summarise()` has grouped output by 'Partner'. You can override using the
## `.groups` argument.

## # A tibble: 4 × 4
## # Groups:   Partner [2]
##   Partner Churn     n  freq
##   <fct>   <fct> <int> <dbl>
## 1 No      No     2441 0.670
## 2 No      Yes    1200 0.330
## 3 Yes     No     2733 0.803
## 4 Yes     Yes     669 0.197

Aproximadamente la mitad de las personas tienen pareja. De las personas con pareja, 20% abandonan. Para personas sin pareja, aproximadamente 33% abandonan.

A continuación, echaré un vistazo a la categoría Dependientes.

#Dependents
ggplot(data = df) +
  geom_bar(aes(x = Dependents, fill = Dependents))

ggplot(df) + 
  geom_bar(aes_string(x="Dependents", fill="Churn"), position = "dodge")

df %>% group_by(Dependents) %>% 
  summarise(n = n()) %>% 
  mutate(freq = n / sum(n))

## # A tibble: 2 × 3
##   Dependents     n  freq
##   <fct>      <int> <dbl>
## 1 No          4933 0.700
## 2 Yes         2110 0.300

df %>% group_by(Dependents, Churn) %>% 
  summarise(n=n()) %>% 
  mutate(freq = n / sum(n))

## `summarise()` has grouped output by 'Dependents'. You can override using the
## `.groups` argument.

## # A tibble: 4 × 4
## # Groups:   Dependents [2]
##   Dependents Churn     n  freq
##   <fct>      <fct> <int> <dbl>
## 1 No         No     3390 0.687
## 2 No         Yes    1543 0.313
## 3 Yes        No     1784 0.845
## 4 Yes        Yes     326 0.155

Aproximadamente el 30% de las personas tienen dependientes, de los cuales el 15% abandona. Para el otro 70% que no tiene dependientes, el 31% abandona.

Otra visualización útil es el diagrama de caja y bigotes. Esto nos da una imagen un poco más compacta de nuestros datos y nos ayuda a identificar valores atípicos. Echemos un vistazo a algunas gráficas de cajas y bigotes para conocer los cargos totales de los diferentes segmentos de clientes.

# Senior Citizens 
ggplot(df, aes(x = SeniorCitizen, y = TotalCharges)) + 
  geom_boxplot()

# Partner
ggplot(df, aes(x = Partner, y = TotalCharges)) + 
  geom_boxplot()

# Dependents
ggplot(df, aes(x = Dependents, y = TotalCharges)) + 
  geom_boxplot()

Después de ver estos resultados iniciales, podemos hacer algunas preguntas más. Es posible que queramos comparar los cargos totales de las personas mayores, las personas sin pareja y las personas sin dependientes.

Estos parecen ser los subconjuntos de personas con mayor probabilidad de abandono dentro de sus respectivos segmentos de clientes. Vamos a compararlos para que podamos identificar dónde podríamos centrar nuestros esfuerzos.

# Total charges and tenure of senior citizens
df %>% 
  select(SeniorCitizen, Churn, TotalCharges, tenure) %>% 
  filter(SeniorCitizen == 1, Churn == "Yes") %>% 
  summarise(n = n(),
            total = sum(TotalCharges),
            avg_tenure = sum(tenure)/n)

## # A tibble: 1 × 3
##       n   total avg_tenure
##   <int>   <dbl>      <dbl>
## 1   476 882405.       21.0

# Total charges and tenure of people without a partner
df %>% 
  select(Partner, Churn, TotalCharges, tenure) %>% 
  filter(Partner == "No", Churn == "Yes") %>% 
  summarise(n = n(),
            total = sum(TotalCharges),
            avg_tenure = sum(tenure)/n)

## # A tibble: 1 × 3
##       n    total avg_tenure
##   <int>    <dbl>      <dbl>
## 1  1200 1306776.       13.2

# Total charges and tenure of people without dependents
df %>% 
  select(Dependents, Churn, TotalCharges, tenure) %>% 
  filter(Dependents == "No", Churn == "Yes") %>% 
  summarise(n = n(),
            total = sum(TotalCharges),
            avg_tenure = sum(tenure)/n)

## # A tibble: 1 × 3
##       n    total avg_tenure
##   <int>    <dbl>      <dbl>
## 1  1543 2261840.       17.1

Aquí hay un resumen de los cargos totales (redondeando) para cada segmento de clientes que se produjo:

Segmento Clientes	Total Charges
Senior Citizens	900,000
No Partners	1,300,000
No Dependents	2,300,000

Con base en los resultados, debemos enfocar nuestros esfuerzos en personas sin dependientes. Este segmento de clientes que abandonó tenía casi 2.3MM en cargos totales en comparación con 1.3MM para personas sin pareja, y solo 900K para personas mayores.

Profundicemos un poco más y veamos qué servicios utiliza el segmento de clientes.

dependents <- df %>% filter(Dependents == "No")

ggplotGrid(ncol=2,
lapply(c("PhoneService","MultipleLines","InternetService","OnlineSecurity","OnlineBackup",
         "DeviceProtection"),
       function(col){
         ggplot(dependents,aes_string(col)) + geom_bar(aes(fill=Churn),position="dodge")
       }))

ggplotGrid(ncol=2,
lapply(c("TechSupport","StreamingTV","StreamingMovies","Contract",
         "PaperlessBilling"),
       function(col){
         ggplot(dependents,aes_string(col)) + geom_bar(aes(fill=Churn),position="dodge")
       }))

ggplot(dependents) +
  geom_bar(aes(x=PaymentMethod,fill=Churn), position = "dodge")

Echando un vistazo a los resultados, obtenemos algunas ideas potenciales:

Mucha gente con servicio telefónico abandonó. Tal vez estas personas realmente no usan el servicio telefónico. Moverlos a un plan sin servicio telefónico para ahorrarles algo de dinero en su factura podría ayudar a retenerlos.
Las personas con internet de fibra óptica abandonaron mucho más que las personas con DSL o sin internet. Quizás mudar a algunas de esas personas a DSL o eliminar su servicio de internet sería una opción. Otra opción podría ser algún tipo de reducción de precios en su plan de fibra óptica como una especie de promoción por ser un cliente fiel.
Personas sin servicio de respaldo en línea, protección de dispositivos y seguridad en línea abandonaron con bastante frecuencia. Tal vez sus dispositivos se han bloqueado, haciéndoles perder valiosos archivos. También pueden haber sufrido fraude o robo de identidad que los ha dejado muy descontentos. Mover a estas personas a algunos de estos servicios puede ayudar a proteger sus sistemas, evitando así muchos dolores de cabeza no deseados.
De manera similar a la copia de seguridad y seguridad en línea, aquellos sin protección del dispositivo tendieron a abandonar más que los que se suscribieron al servicio. Agregar protección del dispositivo a sus planes puede ser una buena manera de evitar la deserción.
Aquellos sin soporte técnico tienden a abandonar más frecuentemente que aquellos con soporte técnico. Mover a los clientes a cuentas de soporte técnico podría ser otra forma potencial de evitar la deserción.

Hay una serie de otras ideas diferentes que podríamos obtener de los datos, pero esta sería una buena lista inicial para investigar más a fondo si la empresa tuviera conjuntos de datos aún más detallados.

Analisis Adicionales

#TotalCharges
ggplot(data = df) +
  geom_histogram(mapping = aes(x = TotalCharges), binwidth = 100)

ggplot(data = df) +
  geom_histogram(mapping = aes(x = TotalCharges, fill = Churn), binwidth = 100)

ggplot(data = df, mapping = aes(x = TotalCharges, colour = Churn)) +
  geom_freqpoly(binwidth = 100)

df2 <- df %>% 
  filter(TotalCharges < 2500)
  
ggplot(data = df2, mapping = aes(x = TotalCharges)) +
  geom_histogram(binwidth = 100)

ggplot(df) + 
  geom_histogram(mapping = aes(x = TotalCharges), binwidth = 100) +
  coord_cartesian(ylim = c(0, 200), xlim = c(0, 2500))

#tenure
ggplot(data = df) +
  geom_histogram(mapping = aes(x = tenure), binwidth = 1)

ggplot(data = df) +
  geom_histogram(mapping = aes(x = tenure, fill = Churn), binwidth = 1)

ggplot(data = df, mapping = aes(x = tenure, colour = Churn)) +
  geom_freqpoly(binwidth = 1)

df3 <- df %>% 
  filter(tenure < 20)
  
ggplot(data = df3, mapping = aes(x = tenure)) +
  geom_histogram(binwidth = 1)

ggplot(df) + 
  geom_histogram(mapping = aes(x = tenure), binwidth = 1) +
  coord_cartesian(ylim = c(0, 200), xlim = c(0, 20))

#TotalCharges, tenure
ggplot(data = df) +
  geom_point(mapping = aes(x = tenure, y = TotalCharges), na.rm = TRUE)

ggplot(data = df) +
  geom_point(mapping = aes(x = tenure, y = TotalCharges), alpha = 1/10, na.rm = TRUE)

#TotalCharges, Contract
ggplot(data = df) +
  geom_boxplot(mapping = aes(x = reorder(Contract, TotalCharges, FUN = median), y = TotalCharges))

ggplot(data = df) +
  geom_boxplot(mapping = aes(x = reorder(Contract, TotalCharges, FUN = median), y = TotalCharges))+
  coord_flip()

#tenure, Contract
ggplot(data = df) +
  geom_boxplot(mapping = aes(x = reorder(Contract, tenure, FUN = median), y = tenure))

ggplot(data = df) +
  geom_boxplot(mapping = aes(x = reorder(Contract, tenure, FUN = median), y = tenure))+
  coord_flip()

#Contract, Churn
ggplot(data = df) +
  geom_count(mapping = aes(x = Contract, y = Churn))

df %>% 
  count(Contract, Churn) %>%  
  ggplot(mapping = aes(x = Contract, y = Churn)) +
    geom_tile(mapping = aes(fill = n))

Ahora que hemos hecho un análisis exploratorio básico, comencemos a hacer algunos modelos predictivos.

Análisis Exploratorio de Datos - Telco Data