Attrition Model - Retención de talento

Resumen

En una economía “sana”, una cierta cantidad de rotación voluntaria de empleados está dentro de los rangos normales. Las personas cambian de trabajo por muchas razones: familia, conveniencia, compensación, oportunidades de crecimiento y más. Sin embargo, siempre es bueno comprender la motivos detrás de la rotación de las personas para que se puedan tomar las medidas adecuadas y controlar fuga de talentos. Todos los días las organizaciones están generando nuevos datos sobre sus colaboradores. Pero ¿podría usar estos datos con mayor efectividad para la gestión de personas?

¿Por qué es importante la rotación? El problema asocia múltiples costos:

1. Costo de reclutamiento y selección.

2. Entrenamiento y formación.

3. Pérdida de conocimiento.

4. Impacto en las relaciones internas y con cliente

Debido a todas las implicancias que conduce, si una persona renuncia, se activan los costos de reclutamiento, selección y contratación, ya que debemos reemplazar a la gente para mantener el proceso trabajando. Una vez que, entre la nueva persona, se debe entrenar y capacitar para no perder la productividad esperada.

En este punto se han perdido las habilidades que ya había desarrollado la otra persona que dejó la empresa. Según estudios, las compañías invierten entre 6 a 9 sueldos en encontrar, reemplazar y capacitar a un nuevo trabajador.

1. Los empleados de nivel inicial cuestan del 30% al 50% de su salario anual.

2. Los empleados de nivel medio cuestan el 150% de su salario anual.

3. Los empleados especializados o de alto nivel cuestan hasta el 400% de su salario anual.

Con el objetivo de querer aportar en esta área algunas herramientas que permitan mejorar estos indicadores de gestión, les comparto un desarrollo de cómo implementar algunos algoritmos para la predicción de rotación de colaboradores y una implementación del algoritmo con 30 casos de la base original para evaluar el alcance predictivo.

El desarrollo no tiene la pretención de explicar descritivamente el comportamiento de las variables (el dataset es facilitado por IBM, Ver fuentes), ya que sabemos que cada empresa tiene realidades distintas. Sin embargo, el objetivo está puesto en cómo los avances en el aprendizaje automático y la ciencia de datos, pueden hacer posible no solo predecir la deserción de los empleados, sino también comprender las variables clave que influyen en la rotación.

Carga de librerías a utilizar

r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
paquetes <- c('data.table',
              'dplyr',
              'tidyr',
              'ggplot2',
              'randomForest',
              'ROCR',
              'purrr',
              'smbinning',
              'rpart',
              'rpart.plot',
              'skirm',
              'readxl',
              'rpart',
              'missForest',
              'xgboost',
              'adabag',
              'PerformanceAnalytics',
              'corrplot',
              'Boruta',
              'e1071',
              'kernlab',
              'grid',
              'Amelia',
              'nnet',
              'reshape2',
              'leaps',
              'foreign',
              'pPROC',
              'caret',
              'discretization',
              'DMwR',
              'corrgram'
)

instalados <- paquetes %in% installed.packages()

if(sum(instalados == FALSE) > 0) {
  install.packages(paquetes[!instalados])
}
lapply(paquetes,require,character.only = TRUE)

Desactiva la notación científica

options(scipen=999)#Desactiva la notación científica

Cargar data y revisión inicial

### Cargamos la base de datos
df <- fread('WA_Fn-UseC_-HR-Employee-Attrition.csv')

#Revisamos inicialmente la calidad de los datos.
str(df)

## Classes 'data.table' and 'data.frame':   1470 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : chr  "Yes" "No" "Yes" "No" ...
##  $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : chr  "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : chr  "Female" "Male" "Male" "Female" ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : chr  "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : chr  "Single" "Married" "Single" "Married" ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
##  $ OverTime                : chr  "Yes" "No" "Yes" "Yes" ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...
##  - attr(*, ".internal.selfref")=<externalptr>

glimpse(df)

## Rows: 1,470
## Columns: 35
## $ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35...
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No"...
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Tra...
## $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 13...
## $ Department               <chr> "Sales", "Research & Development", "Resear...
## $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 2...
## $ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, ...
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other",...
## $ EmployeeCount            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, ...
## $ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, ...
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male"...
## $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84...
## $ JobInvolvement           <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, ...
## $ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, ...
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "...
## $ JobSatisfaction          <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, ...
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", ...
## $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, ...
## $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 11864, 9...
## $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, ...
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Ye...
## $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13...
## $ PerformanceRating        <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, ...
## $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, ...
## $ StandardHours            <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80...
## $ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, ...
## $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5...
## $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, ...
## $ WorkLifeBalance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, ...
## $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2,...
## $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, ...
## $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, ...
## $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, ...

Calidad de los datos:

#Editamos el dataset con el objetivo de categorizar bien las variables.
df <- df %>%
  mutate(Education = as.factor(if_else(Education == 1,"Below College", if_else(Education == 2, "College", if_else(Education == 3, "Bachelor", if_else(Education == 4, "Master","Doctor")))))
         ,EnvironmentSatisfaction = as.factor(if_else(EnvironmentSatisfaction == 1,"Low",if_else(EnvironmentSatisfaction == 2, "Medium", if_else(EnvironmentSatisfaction == 3, "High", "Very High"))))
         ,JobInvolvement = as.factor(if_else(JobInvolvement == 1,"Low",if_else(JobInvolvement == 2, "Medium",if_else(JobInvolvement == 3, "High", "Very High"))))
         ,JobSatisfaction = as.factor(if_else(JobSatisfaction == 1, "Low",if_else(JobSatisfaction == 2, "Medium",if_else(JobSatisfaction == 3, "High","Very High"))))
         ,PerformanceRating = as.factor(if_else(PerformanceRating == 1, "Low",if_else(PerformanceRating == 2, "Good", if_else(PerformanceRating == 3, "Excellent", "Outstanding"))))
         ,RelationshipSatisfaction = as.factor(if_else(RelationshipSatisfaction == 1, "Low",if_else(RelationshipSatisfaction == 2, "Medium", if_else(RelationshipSatisfaction == 3, "High", "Very High"))))
         ,WorkLifeBalance = as.factor(if_else(WorkLifeBalance == 1, "Bad",if_else(WorkLifeBalance == 2, "Good", if_else(WorkLifeBalance == 3, "Better", "Best"))))
         ,JobLevel = as.factor(JobLevel)
  ) %>%
  select(-EmployeeCount, -EmployeeNumber, -Over18, -StandardHours, -StockOptionLevel, -JobLevel)



#### Convertir las variables de clase a factor
df$Attrition = factor ( df$Attrition)
df$BusinessTravel = factor ( df$BusinessTravel)
df$EducationField = factor ( df$EducationField)
df$Department = factor ( df$Department)
df$Gender = factor ( df$Gender)
df$JobRole = factor ( df$JobRole)
df$MaritalStatus = factor ( df$MaritalStatus)
df$OverTime = factor ( df$OverTime)

#### Primer resumen de datos
summary(df)

##       Age        Attrition            BusinessTravel   DailyRate     
##  Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  
##  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  
##  Median :36.00              Travel_Rarely    :1043   Median : 802.0  
##  Mean   :36.92                                       Mean   : 802.5  
##  3rd Qu.:43.00                                       3rd Qu.:1157.0  
##  Max.   :60.00                                       Max.   :1499.0  
##                                                                      
##                   Department  DistanceFromHome         Education  
##  Human Resources       : 63   Min.   : 1.000   Bachelor     :572  
##  Research & Development:961   1st Qu.: 2.000   Below College:170  
##  Sales                 :446   Median : 7.000   College      :282  
##                               Mean   : 9.193   Doctor       : 48  
##                               3rd Qu.:14.000   Master       :398  
##                               Max.   :29.000                      
##                                                                   
##           EducationField EnvironmentSatisfaction    Gender      HourlyRate    
##  Human Resources : 27    High     :453           Female:588   Min.   : 30.00  
##  Life Sciences   :606    Low      :284           Male  :882   1st Qu.: 48.00  
##  Marketing       :159    Medium   :287                        Median : 66.00  
##  Medical         :464    Very High:446                        Mean   : 65.89  
##  Other           : 82                                         3rd Qu.: 83.75  
##  Technical Degree:132                                         Max.   :100.00  
##                                                                               
##    JobInvolvement                      JobRole     JobSatisfaction
##  High     :868    Sales Executive          :326   High     :442   
##  Low      : 83    Research Scientist       :292   Low      :289   
##  Medium   :375    Laboratory Technician    :259   Medium   :280   
##  Very High:144    Manufacturing Director   :145   Very High:459   
##                   Healthcare Representative:131                   
##                   Manager                  :102                   
##                   (Other)                  :215                   
##   MaritalStatus MonthlyIncome    MonthlyRate    NumCompaniesWorked OverTime  
##  Divorced:327   Min.   : 1009   Min.   : 2094   Min.   :0.000      No :1054  
##  Married :673   1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000      Yes: 416  
##  Single  :470   Median : 4919   Median :14236   Median :2.000                
##                 Mean   : 6503   Mean   :14313   Mean   :2.693                
##                 3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000                
##                 Max.   :19999   Max.   :26999   Max.   :9.000                
##                                                                              
##  PercentSalaryHike   PerformanceRating RelationshipSatisfaction
##  Min.   :11.00     Excellent  :1244    High     :459           
##  1st Qu.:12.00     Outstanding: 226    Low      :276           
##  Median :14.00                         Medium   :303           
##  Mean   :15.21                         Very High:432           
##  3rd Qu.:18.00                                                 
##  Max.   :25.00                                                 
##                                                                
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Bad   : 80      Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         Best  :153      1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Better:893      Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Good  :344      Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000                         3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000                         Max.   :40.000  
##                                                                          
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000      
##

str(df)

## 'data.frame':    1470 obs. of  29 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : Factor w/ 5 levels "Bachelor","Below College",..: 3 2 3 5 2 3 1 2 1 1 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : Factor w/ 4 levels "High","Low","Medium",..: 3 1 4 4 2 4 1 4 4 1 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : Factor w/ 4 levels "High","Low","Medium",..: 1 3 3 1 1 1 4 1 3 1 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : Factor w/ 4 levels "High","Low","Medium",..: 4 3 1 1 3 4 2 1 1 1 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : Factor w/ 2 levels "Excellent","Outstanding": 1 2 1 1 1 1 2 2 2 1 ...
##  $ RelationshipSatisfaction: Factor w/ 4 levels "High","Low","Medium",..: 2 4 3 1 4 1 2 3 3 3 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : Factor w/ 4 levels "Bad","Best","Better",..: 1 3 3 3 3 4 4 3 3 4 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

Calidad de datos: Análisis de nulos

data.frame(colSums(is.na(df)))

##                          colSums.is.na.df..
## Age                                       0
## Attrition                                 0
## BusinessTravel                            0
## DailyRate                                 0
## Department                                0
## DistanceFromHome                          0
## Education                                 0
## EducationField                            0
## EnvironmentSatisfaction                   0
## Gender                                    0
## HourlyRate                                0
## JobInvolvement                            0
## JobRole                                   0
## JobSatisfaction                           0
## MaritalStatus                             0
## MonthlyIncome                             0
## MonthlyRate                               0
## NumCompaniesWorked                        0
## OverTime                                  0
## PercentSalaryHike                         0
## PerformanceRating                         0
## RelationshipSatisfaction                  0
## TotalWorkingYears                         0
## TrainingTimesLastYear                     0
## WorkLifeBalance                           0
## YearsAtCompany                            0
## YearsInCurrentRole                        0
## YearsSinceLastPromotion                   0
## YearsWithCurrManager                      0

Calidad de datos: Análisis de atípicos

Analizamos las que son de tipo numérico

out <- function(variable){
  t(t(head(sort(variable,decreasing = T),20))) 
}
lapply(df,function(x){
  if(is.double(x)) out(x)
})

## $Age
## NULL
## 
## $Attrition
## NULL
## 
## $BusinessTravel
## NULL
## 
## $DailyRate
## NULL
## 
## $Department
## NULL
## 
## $DistanceFromHome
## NULL
## 
## $Education
## NULL
## 
## $EducationField
## NULL
## 
## $EnvironmentSatisfaction
## NULL
## 
## $Gender
## NULL
## 
## $HourlyRate
## NULL
## 
## $JobInvolvement
## NULL
## 
## $JobRole
## NULL
## 
## $JobSatisfaction
## NULL
## 
## $MaritalStatus
## NULL
## 
## $MonthlyIncome
## NULL
## 
## $MonthlyRate
## NULL
## 
## $NumCompaniesWorked
## NULL
## 
## $OverTime
## NULL
## 
## $PercentSalaryHike
## NULL
## 
## $PerformanceRating
## NULL
## 
## $RelationshipSatisfaction
## NULL
## 
## $TotalWorkingYears
## NULL
## 
## $TrainingTimesLastYear
## NULL
## 
## $WorkLifeBalance
## NULL
## 
## $YearsAtCompany
## NULL
## 
## $YearsInCurrentRole
## NULL
## 
## $YearsSinceLastPromotion
## NULL
## 
## $YearsWithCurrManager
## NULL

Analizamos las que son de tipo integer

Análisis longitudinal

longi <- df %>% 
  summarise_all(mean) %>% 
  t() %>% 
  as.data.frame() 
data.frame(variable = rownames(longi), media = longi$V1) %>% 
  arrange(desc(variable))

##                    variable        media
## 1      YearsWithCurrManager     4.123129
## 2   YearsSinceLastPromotion     2.187755
## 3        YearsInCurrentRole     4.229252
## 4            YearsAtCompany     7.008163
## 5           WorkLifeBalance           NA
## 6     TrainingTimesLastYear     2.799320
## 7         TotalWorkingYears    11.279592
## 8  RelationshipSatisfaction           NA
## 9         PerformanceRating           NA
## 10        PercentSalaryHike    15.209524
## 11                 OverTime           NA
## 12       NumCompaniesWorked     2.693197
## 13              MonthlyRate 14313.103401
## 14            MonthlyIncome  6502.931293
## 15            MaritalStatus           NA
## 16          JobSatisfaction           NA
## 17                  JobRole           NA
## 18           JobInvolvement           NA
## 19               HourlyRate    65.891156
## 20                   Gender           NA
## 21  EnvironmentSatisfaction           NA
## 22           EducationField           NA
## 23                Education           NA
## 24         DistanceFromHome     9.192517
## 25               Department           NA
## 26                DailyRate   802.485714
## 27           BusinessTravel           NA
## 28                Attrition           NA
## 29                      Age    36.923810

Análisis exploratorio de variables

df %>%
  select_if(is.integer) %>%
  gather() %>%
  ggplot(aes(value)) + geom_density() + facet_wrap(~key,scales='free') +
  theme(axis.text=element_text(size=6))

df %>%
  select_if(is.factor) %>%
  gather() %>%
  ggplot(aes(value)) + geom_bar() + facet_wrap(~key,scales='free') +
  theme(axis.text=element_text(size=6))

#Hacemos análisis de correlaciones
cor <- df %>%
  select_if(is.integer) %>%
  cor() %>% 
  round(digits = 2)

corrgram(cor,order=TRUE,lower.panel=panel.shade,upper.panel=panel.pie)

corrplot(cor, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

Terminar el procesamiento de los datos.

#Codificación de las variables categóricas
dmy <- dummyVars(~., data = df[-2])
trsf <- data.frame(predict(dmy, newdata = df[-2]))

#Revisión de asimetría de los datos y tratamiento de los mismos (Skewness)

trsf <- trsf %>%
  mutate(Age = log(Age + 1)
         ,DailyRate = log(DailyRate + 1)
         ,DistanceFromHome = log(DistanceFromHome + 1)
         ,HourlyRate = log(HourlyRate + 1)
         ,MonthlyIncome = log(MonthlyIncome + 1)
         ,MonthlyRate = log(MonthlyRate + 1)
         ,NumCompaniesWorked = log(NumCompaniesWorked + 1)
         ,PercentSalaryHike = log(PercentSalaryHike + 1)
         ,TotalWorkingYears = log(TotalWorkingYears + 1)
         ,TrainingTimesLastYear = log(TrainingTimesLastYear + 1)
         ,YearsAtCompany = log(YearsAtCompany +1)
         ,YearsInCurrentRole = log(YearsInCurrentRole + 1)
         ,YearsSinceLastPromotion = log(YearsSinceLastPromotion + 1)
         ,YearsWithCurrManager = log(YearsWithCurrManager + 1))

prep_num = preProcess(trsf, method=c("center", "scale"))
final_df = predict(prep_num, trsf)

# Mover las correlaciones más altas

cor_mat<- cor(final_df)
high_corr <- findCorrelation(cor_mat, cutoff = 0.85)
names(trsf)[high_corr]

## [1] "Department.Research...Development" "Department.Human.Resources"       
## [3] "PerformanceRating.Outstanding"     "Gender.Male"                      
## [5] "OverTime.Yes"

# Se establece el dataset final.
final_df <- cbind(trsf, df[2])

final_df <- final_df %>%
  mutate(Attrition = if_else(Attrition == "Yes",1,0)) %>%
  select(-Department.Research...Development,-Department.Human.Resources,-PerformanceRating.Outstanding,-Gender.Male,-OverTime.Yes)

# Para ver la proporción de los datos de mi variable a predecir. 
table(final_df$Attrition)

## 
##    0    1 
## 1233  237

Metodos de seleccion de variables (aplicamos boruta Boruta)

https://www.datacamp.com/community/tutorials/feature-selection-R-boruta

set.seed(1)
dfboruta <- Boruta(Attrition ~ ., data = final_df, doTrace = 2, ntree = 100)
print(dfboruta)

## Boruta performed 99 iterations in 11.37955 secs.
##  11 attributes confirmed important: Age,
## BusinessTravel.Travel_Frequently, JobInvolvement.Low,
## MaritalStatus.Single, MonthlyIncome and 6 more;
##  50 attributes confirmed unimportant: BusinessTravel.Non.Travel,
## BusinessTravel.Travel_Rarely, DailyRate, Department.Sales,
## DistanceFromHome and 45 more;
##  3 tentative attributes left: EnvironmentSatisfaction.Low,
## JobRole.Sales.Representative, WorkLifeBalance.Bad;

dfboruta2 <- TentativeRoughFix(dfboruta)
print(dfboruta2)

## Boruta performed 99 iterations in 11.37955 secs.
## Tentatives roughfixed over the last 99 iterations.
##  12 attributes confirmed important: Age,
## BusinessTravel.Travel_Frequently, JobInvolvement.Low,
## JobRole.Sales.Representative, MaritalStatus.Single and 7 more;
##  52 attributes confirmed unimportant: BusinessTravel.Non.Travel,
## BusinessTravel.Travel_Rarely, DailyRate, Department.Sales,
## DistanceFromHome and 47 more;

plot(dfboruta2, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(dfboruta2$ImpHistory),function(i)
dfboruta2$ImpHistory[is.finite(dfboruta2$ImpHistory[,i]),i])
names(lz) <- colnames(dfboruta2$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(dfboruta2$ImpHistory), cex.axis = 0.7)

getSelectedAttributes(dfboruta2, withTentative = F)

##  [1] "Age"                              "BusinessTravel.Travel_Frequently"
##  [3] "JobInvolvement.Low"               "JobRole.Sales.Representative"    
##  [5] "MaritalStatus.Single"             "MonthlyIncome"                   
##  [7] "NumCompaniesWorked"               "OverTime.No"                     
##  [9] "TotalWorkingYears"                "YearsAtCompany"                  
## [11] "YearsInCurrentRole"               "YearsWithCurrManager"

Boruta

boruta3 <- attStats(dfboruta2)
print(boruta3)

##                                        meanImp    medianImp      minImp
## Age                                 6.45037383  6.530997282  1.84622231
## BusinessTravel.Non.Travel           0.66175851  0.689340492 -1.65137238
## BusinessTravel.Travel_Frequently    2.64130944  2.701833435 -0.05560411
## BusinessTravel.Travel_Rarely        0.13162355  0.003317546 -1.75592670
## DailyRate                           0.31154357  0.139747877 -2.03990911
## Department.Sales                    0.71244019  0.637815646 -2.16985707
## DistanceFromHome                    0.75770782  0.644062856 -1.67121743
## Education.Bachelor                 -0.25081264 -0.186592310 -1.64411056
## Education.Below.College             0.07047972  0.251581717 -1.54291262
## Education.College                  -0.85399705 -0.876644859 -1.61984938
## Education.Doctor                    0.17890872  0.269538218 -1.43705903
## Education.Master                    0.38311601  0.201395388 -2.18992597
## EducationField.Human.Resources     -0.23663384 -0.199497220 -0.98849682
## EducationField.Life.Sciences       -0.07838005 -0.363342721 -1.73405783
## EducationField.Marketing            1.10172919  0.934185018 -0.86706381
## EducationField.Medical              0.40665755  0.438838931 -1.85849159
## EducationField.Other               -0.32291143 -0.725545945 -1.33021541
## EducationField.Technical.Degree     0.30467688  0.390316597 -1.57782682
## EnvironmentSatisfaction.High        0.39959524  0.853423427 -2.06174510
## EnvironmentSatisfaction.Low         2.54926587  2.570597075 -0.70243688
## EnvironmentSatisfaction.Medium     -0.23622661 -0.538296025 -1.80623242
## EnvironmentSatisfaction.Very.High   0.23425753  0.082001356 -0.84509361
## Gender.Female                       0.34505141  0.561749480 -2.38442074
## HourlyRate                         -0.08715462 -0.173416900 -1.36022476
## JobInvolvement.High                 0.29150721  0.148906252 -1.34310413
## JobInvolvement.Low                  2.76977909  2.843587540 -0.59972271
## JobInvolvement.Medium               0.16016674  0.046437885 -1.86032844
## JobInvolvement.Very.High            0.44058967  0.288121572 -0.93015133
## JobRole.Healthcare.Representative   0.12018904  0.176806488 -1.07207094
## JobRole.Human.Resources             0.06862365  0.229307584 -2.26747727
## JobRole.Laboratory.Technician       1.24686688  1.148519545 -1.27316406
## JobRole.Manager                     0.47881201  0.446091151 -0.58402726
## JobRole.Manufacturing.Director      0.02995089 -0.308023953 -1.63986074
## JobRole.Research.Director           0.33274284  0.397837646 -1.11104768
## JobRole.Research.Scientist          0.82171704  0.777772881 -1.24504821
## JobRole.Sales.Executive             0.30630215  0.406670676 -1.26305038
## JobRole.Sales.Representative        2.23229485  2.208716340 -0.37922490
## JobSatisfaction.High               -0.21486263 -0.214842846 -1.81975481
## JobSatisfaction.Low                 0.71915220  1.093089606 -1.43247810
## JobSatisfaction.Medium             -0.13231433 -0.012759739 -2.51176460
## JobSatisfaction.Very.High           1.32078729  1.397888103 -0.62277519
## MaritalStatus.Divorced              1.17517862  1.323703831 -0.43073392
## MaritalStatus.Married               0.68131273  0.772785665 -1.41641949
## MaritalStatus.Single                3.73288312  3.782982007  1.35300903
## MonthlyIncome                       6.77735496  6.735880038  4.27654036
## MonthlyRate                         0.42109402  0.284385716 -0.75403603
## NumCompaniesWorked                  2.58440689  2.700131544 -1.01137499
## OverTime.No                        11.22438420 11.418767140  7.16549830
## PercentSalaryHike                   0.23137675  0.183057917 -1.64714211
## PerformanceRating.Excellent         0.17690256  0.151734821 -1.59279471
## RelationshipSatisfaction.High       0.44926549  0.334475196 -1.13472719
## RelationshipSatisfaction.Low        0.90230771  0.720613299 -0.99238267
## RelationshipSatisfaction.Medium    -0.38284937 -0.570205696 -2.25260177
## RelationshipSatisfaction.Very.High -0.11290756  0.070548747 -2.27802251
## TotalWorkingYears                   6.24915575  6.332090381  3.03471452
## TrainingTimesLastYear               0.13495252  0.109381597 -2.02196369
## WorkLifeBalance.Bad                 1.49437690  1.599646164 -1.20697132
## WorkLifeBalance.Best               -0.22480104 -0.336018421 -1.51770288
## WorkLifeBalance.Better              1.02708622  1.081089971 -1.23825949
## WorkLifeBalance.Good                0.73280866  0.656039742 -1.69121467
## YearsAtCompany                      4.66803502  4.759493182  1.80506490
## YearsInCurrentRole                  3.11357831  3.260915699 -0.01330695
## YearsSinceLastPromotion             0.82508820  0.782770163 -1.41339737
## YearsWithCurrManager                3.81419935  3.747453957  1.44246269
##                                        maxImp   normHits  decision
## Age                                10.9580559 0.98989899 Confirmed
## BusinessTravel.Non.Travel           2.2357841 0.00000000  Rejected
## BusinessTravel.Travel_Frequently    5.4780572 0.69696970 Confirmed
## BusinessTravel.Travel_Rarely        2.3066057 0.00000000  Rejected
## DailyRate                           2.7387658 0.03030303  Rejected
## Department.Sales                    2.3629386 0.00000000  Rejected
## DistanceFromHome                    2.7829193 0.05050505  Rejected
## Education.Bachelor                  1.2833262 0.00000000  Rejected
## Education.Below.College             1.2966277 0.00000000  Rejected
## Education.College                   0.1599436 0.00000000  Rejected
## Education.Doctor                    1.4792676 0.00000000  Rejected
## Education.Master                    1.7291497 0.00000000  Rejected
## EducationField.Human.Resources      1.0409924 0.00000000  Rejected
## EducationField.Life.Sciences        2.0717543 0.01010101  Rejected
## EducationField.Marketing            3.4414443 0.06060606  Rejected
## EducationField.Medical              2.3976384 0.01010101  Rejected
## EducationField.Other                1.6734537 0.00000000  Rejected
## EducationField.Technical.Degree     1.9875480 0.01010101  Rejected
## EnvironmentSatisfaction.High        2.1655299 0.01010101  Rejected
## EnvironmentSatisfaction.Low         6.3113315 0.66666667  Rejected
## EnvironmentSatisfaction.Medium      2.0659808 0.00000000  Rejected
## EnvironmentSatisfaction.Very.High   2.2082436 0.01010101  Rejected
## Gender.Female                       2.4979122 0.00000000  Rejected
## HourlyRate                          1.7804527 0.00000000  Rejected
## JobInvolvement.High                 1.9865370 0.00000000  Rejected
## JobInvolvement.Low                  5.5357794 0.75757576 Confirmed
## JobInvolvement.Medium               1.8879255 0.00000000  Rejected
## JobInvolvement.Very.High            2.3413774 0.01010101  Rejected
## JobRole.Healthcare.Representative   1.5884720 0.00000000  Rejected
## JobRole.Human.Resources             1.3174662 0.00000000  Rejected
## JobRole.Laboratory.Technician       3.3359541 0.29292929  Rejected
## JobRole.Manager                     1.9342300 0.00000000  Rejected
## JobRole.Manufacturing.Director      2.5301214 0.01010101  Rejected
## JobRole.Research.Director           2.0561523 0.01010101  Rejected
## JobRole.Research.Scientist          3.0490228 0.02020202  Rejected
## JobRole.Sales.Executive             1.9648835 0.00000000  Rejected
## JobRole.Sales.Representative        5.0535741 0.59595960 Confirmed
## JobSatisfaction.High                1.0224796 0.00000000  Rejected
## JobSatisfaction.Low                 1.6927892 0.00000000  Rejected
## JobSatisfaction.Medium              1.6460921 0.00000000  Rejected
## JobSatisfaction.Very.High           2.9631515 0.02020202  Rejected
## MaritalStatus.Divorced              2.6240354 0.03030303  Rejected
## MaritalStatus.Married               3.5704549 0.02020202  Rejected
## MaritalStatus.Single                5.7657113 0.93939394 Confirmed
## MonthlyIncome                       9.5610155 1.00000000 Confirmed
## MonthlyRate                         1.8250827 0.00000000  Rejected
## NumCompaniesWorked                  5.2414963 0.68686869 Confirmed
## OverTime.No                        15.8521886 1.00000000 Confirmed
## PercentSalaryHike                   3.1324619 0.02020202  Rejected
## PerformanceRating.Excellent         1.8313203 0.00000000  Rejected
## RelationshipSatisfaction.High       2.4846029 0.00000000  Rejected
## RelationshipSatisfaction.Low        3.4082527 0.07070707  Rejected
## RelationshipSatisfaction.Medium     2.2937267 0.01010101  Rejected
## RelationshipSatisfaction.Very.High  1.6397116 0.00000000  Rejected
## TotalWorkingYears                   8.4562704 1.00000000 Confirmed
## TrainingTimesLastYear               2.1098181 0.01010101  Rejected
## WorkLifeBalance.Bad                 4.1949965 0.32323232  Rejected
## WorkLifeBalance.Best                1.2113968 0.00000000  Rejected
## WorkLifeBalance.Better              3.3183128 0.04040404  Rejected
## WorkLifeBalance.Good                3.0005942 0.01010101  Rejected
## YearsAtCompany                      7.6981916 0.96969697 Confirmed
## YearsInCurrentRole                  6.5876411 0.80808081 Confirmed
## YearsSinceLastPromotion             2.5591235 0.02020202  Rejected
## YearsWithCurrManager                6.6005233 0.92929293 Confirmed

Selección de las variables finales para el modelo

final_df <- final_df %>%
  select(JobRole.Sales.Representative, NumCompaniesWorked, BusinessTravel.Travel_Frequently, JobInvolvement.Low, YearsInCurrentRole, YearsWithCurrManager, MaritalStatus.Single, YearsAtCompany, TotalWorkingYears, Age, MonthlyIncome, OverTime.No, Attrition)
final_df$Attrition = factor ( final_df$Attrition)

table(final_df$Attrition)

## 
##    0    1 
## 1233  237

Definición de dataset de testeo y entrenamiento

#Selección de dataset de entrenamiento y testeo

set.seed(1)
index <- createDataPartition(final_df[,1], p=0.75, list = FALSE)
train <- final_df[index,]
test <- final_df[-index,]

str(train)

## 'data.frame':    1103 obs. of  13 variables:
##  $ JobRole.Sales.Representative    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumCompaniesWorked              : num  2.197 1.946 2.303 1.609 0.693 ...
##  $ BusinessTravel.Travel_Frequently: num  0 0 0 0 0 1 0 0 0 0 ...
##  $ JobInvolvement.Low              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ YearsInCurrentRole              : num  1.61 0 1.1 0 0 ...
##  $ YearsWithCurrManager            : num  1.79 0 1.1 0 0 ...
##  $ MaritalStatus.Single            : num  1 1 0 0 0 1 0 0 0 1 ...
##  $ YearsAtCompany                  : num  1.946 0 1.099 0.693 0.693 ...
##  $ TotalWorkingYears               : num  2.197 2.079 1.946 2.565 0.693 ...
##  $ Age                             : num  3.74 3.64 3.33 4.09 3.43 ...
##  $ MonthlyIncome                   : num  8.7 7.65 8.15 7.89 7.9 ...
##  $ OverTime.No                     : num  0 0 1 0 1 1 1 1 1 0 ...
##  $ Attrition                       : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 2 ...

prop.table(table(train$Attrition))

## 
##         0         1 
## 0.8404352 0.1595648

prop.table(table(test$Attrition))

## 
##         0         1 
## 0.8337875 0.1662125

Aplicación de algoritmos

Aplicación del algoritmo Random Forest

set.seed(1)
forest <- randomForest(Attrition ~ .,data=train,importance=TRUE)

#evaluar en conjunto de prueba--------------------------------------------
predictionsrf <- predict(forest, newdata = test)

#matriz de confusion-------------------------------------------------------
confusionMatrix(predictionsrf,test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 298  51
##          1   8  10
##                                           
##                Accuracy : 0.8392          
##                  95% CI : (0.7976, 0.8753)
##     No Information Rate : 0.8338          
##     P-Value [Acc > NIR] : 0.4226          
##                                           
##                   Kappa : 0.192           
##                                           
##  Mcnemar's Test P-Value : 0.00000004553   
##                                           
##             Sensitivity : 0.9739          
##             Specificity : 0.1639          
##          Pos Pred Value : 0.8539          
##          Neg Pred Value : 0.5556          
##              Prevalence : 0.8338          
##          Detection Rate : 0.8120          
##    Detection Prevalence : 0.9510          
##       Balanced Accuracy : 0.5689          
##                                           
##        'Positive' Class : 0               
##

importance(forest)

##                                          0         1 MeanDecreaseAccuracy
## JobRole.Sales.Representative      6.951697  4.112147             8.484580
## NumCompaniesWorked               10.884138  3.030282            11.406819
## BusinessTravel.Travel_Frequently  8.192549  4.713082             9.442856
## JobInvolvement.Low                9.075979  6.186014            10.712588
## YearsInCurrentRole               12.667339  8.919113            16.894636
## YearsWithCurrManager             10.270908  3.285121            12.490629
## MaritalStatus.Single              6.428801 11.840699            11.376491
## YearsAtCompany                   14.899006  2.497170            16.933074
## TotalWorkingYears                17.668647  1.815352            18.491793
## Age                              17.680327  7.886734            20.239531
## MonthlyIncome                    14.950453  6.587204            18.089071
## OverTime.No                      14.875886 22.511325            23.040566
##                                  MeanDecreaseGini
## JobRole.Sales.Representative             5.327672
## NumCompaniesWorked                      25.096104
## BusinessTravel.Travel_Frequently        10.860923
## JobInvolvement.Low                       7.347062
## YearsInCurrentRole                      18.673823
## YearsWithCurrManager                    20.081600
## MaritalStatus.Single                    12.632842
## YearsAtCompany                          26.679911
## TotalWorkingYears                       35.302102
## Age                                     40.490647
## MonthlyIncome                           54.851854
## OverTime.No                             19.300534

ROC AUC para Random Forest

set.seed(1)
ROCRpred <- prediction(as.numeric(predictionsrf), as.numeric(test$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_rf <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_rf, digits=5, scientific=FALSE)))

Aplicación del algoritmo Decision tree

set.seed(1)
arbol <- rpart(Attrition ~ ., data=train,method = "class") ### En Method se tiene que especificar, porque por defecto deja Gini ###
plot(arbol,uniform=T,margin=0.2)
text (arbol, use.n = T, pretty = TRUE)
title("Training Set's Classification Tree")

#evaluar en conjunto de prueba---------------------------------------------
predictionsar <- predict(arbol, test, type="class")

#matriz de confusion-------------------------------------------------------
confusionMatrix(predictionsar,test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 299  50
##          1   7  11
##                                           
##                Accuracy : 0.8447          
##                  95% CI : (0.8035, 0.8802)
##     No Information Rate : 0.8338          
##     P-Value [Acc > NIR] : 0.3161          
##                                           
##                   Kappa : 0.2194          
##                                           
##  Mcnemar's Test P-Value : 0.00000002651   
##                                           
##             Sensitivity : 0.9771          
##             Specificity : 0.1803          
##          Pos Pred Value : 0.8567          
##          Neg Pred Value : 0.6111          
##              Prevalence : 0.8338          
##          Detection Rate : 0.8147          
##    Detection Prevalence : 0.9510          
##       Balanced Accuracy : 0.5787          
##                                           
##        'Positive' Class : 0               
##

ROC AUC para Decision tree

set.seed(1)
plotcp(arbol)

ROCRpred <- prediction(as.numeric(predictionsar), as.numeric(test$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_dt <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_dt, digits=5, scientific=FALSE)))

Aplicación del algoritmo Support vector machine

set.seed(1)
svm <- ksvm(Attrition ~ ., data = train, kernel = "vanilladot", cross=10)

##  Setting default kernel parameters

#evaluar en conjunto de prueba---------------------------------------------
predictionssvm <- predict(svm, test)

#matriz de confusion-------------------------------------------------------
confusionMatrix(predictionssvm,test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 306  61
##          1   0   0
##                                              
##                Accuracy : 0.8338             
##                  95% CI : (0.7917, 0.8704)   
##     No Information Rate : 0.8338             
##     P-Value [Acc > NIR] : 0.5341             
##                                              
##                   Kappa : 0                  
##                                              
##  Mcnemar's Test P-Value : 0.00000000000001564
##                                              
##             Sensitivity : 1.0000             
##             Specificity : 0.0000             
##          Pos Pred Value : 0.8338             
##          Neg Pred Value :    NaN             
##              Prevalence : 0.8338             
##          Detection Rate : 0.8338             
##    Detection Prevalence : 1.0000             
##       Balanced Accuracy : 0.5000             
##                                              
##        'Positive' Class : 0                  
##

ROC AUC para SVM

set.seed(1)
ROCRpred <- prediction(as.numeric(predictionssvm), as.numeric(test$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_rf <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_rf, digits=5, scientific=FALSE)))

Aplicación del algoritmo Xtreme Gradient Boosting

set.seed(1)
control <- trainControl(method="repeatedcv", number=5)
xgb <- train(Attrition~., data=train, method="xgbTree",  trControl=control)

#evaluar en conjunto de prueba---------------------------------------------
predictionsxgb <- predict(xgb, test)

#matriz de confusion-------------------------------------------------------
confusionMatrix(predictionsxgb,test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 296  45
##          1  10  16
##                                           
##                Accuracy : 0.8501          
##                  95% CI : (0.8094, 0.8851)
##     No Information Rate : 0.8338          
##     P-Value [Acc > NIR] : 0.2222          
##                                           
##                   Kappa : 0.2981          
##                                           
##  Mcnemar's Test P-Value : 0.000004549     
##                                           
##             Sensitivity : 0.9673          
##             Specificity : 0.2623          
##          Pos Pred Value : 0.8680          
##          Neg Pred Value : 0.6154          
##              Prevalence : 0.8338          
##          Detection Rate : 0.8065          
##    Detection Prevalence : 0.9292          
##       Balanced Accuracy : 0.6148          
##                                           
##        'Positive' Class : 0               
##

ROC AUC para XGB

set.seed(1)
ROCRpred <- prediction(as.numeric(predictionsxgb), as.numeric(test$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_xgb <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_xgb, digits=5, scientific=FALSE)))

Aplicación del algoritmo Regresión Logística

set.seed(1)
control <- trainControl(method="repeatedcv", number=5)
lr <- train(Attrition~., data=train, method="glm", trControl=control)

#evaluar en conjunto de prueba---------------------------------------------
predictionslr <- predict(lr, test)

#matriz de confusion-------------------------------------------------------
confusionMatrix(predictionslr,test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 301  48
##          1   5  13
##                                           
##                Accuracy : 0.8556          
##                  95% CI : (0.8154, 0.8899)
##     No Information Rate : 0.8338          
##     P-Value [Acc > NIR] : 0.146           
##                                           
##                   Kappa : 0.2741          
##                                           
##  Mcnemar's Test P-Value : 0.000000007968  
##                                           
##             Sensitivity : 0.9837          
##             Specificity : 0.2131          
##          Pos Pred Value : 0.8625          
##          Neg Pred Value : 0.7222          
##              Prevalence : 0.8338          
##          Detection Rate : 0.8202          
##    Detection Prevalence : 0.9510          
##       Balanced Accuracy : 0.5984          
##                                           
##        'Positive' Class : 0               
##

ROC AUC para la Regresión logística

set.seed(1)
ROCRpred <- prediction(as.numeric(predictionslr), as.numeric(test$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_lr <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_lr, digits=5, scientific=FALSE)))

Comparación de modelos

# Tabulating accuracies
Model <- c('Random Forest','Decision tree','SVM-vanilladot', 'XGB', 'Logistic Regresion')
Accuracy <- c(83.92,84.47,83.38,85.01,85.56)
AUC <- c(56.89,57.87,50,61.48,59.83)
mytable<- data.frame(Model, Accuracy, AUC)
head(mytable)

##                Model Accuracy   AUC
## 1      Random Forest    83.92 56.89
## 2      Decision tree    84.47 57.87
## 3     SVM-vanilladot    83.38 50.00
## 4                XGB    85.01 61.48
## 5 Logistic Regresion    85.56 59.83

Implementación y prueba

set.seed(1)
implementa <- fread('implementación2020.csv') ### cargar el nuevo dataset

#Editamos el dataset con el objetivo de categorizar bien las variables.
implementa <- implementa %>%
  mutate(Education = as.factor(if_else(Education == 1,"Below College", if_else(Education == 2, "College", if_else(Education == 3, "Bachelor", if_else(Education == 4, "Master","Doctor")))))
         ,EnvironmentSatisfaction = as.factor(if_else(EnvironmentSatisfaction == 1,"Low",if_else(EnvironmentSatisfaction == 2, "Medium", if_else(EnvironmentSatisfaction == 3, "High", "Very High"))))
         ,JobInvolvement = as.factor(if_else(JobInvolvement == 1,"Low",if_else(JobInvolvement == 2, "Medium",if_else(JobInvolvement == 3, "High", "Very High"))))
         ,JobSatisfaction = as.factor(if_else(JobSatisfaction == 1, "Low",if_else(JobSatisfaction == 2, "Medium",if_else(JobSatisfaction == 3, "High","Very High"))))
         ,PerformanceRating = as.factor(if_else(PerformanceRating == 1, "Low",if_else(PerformanceRating == 2, "Good", if_else(PerformanceRating == 3, "Excellent", "Outstanding"))))
         ,RelationshipSatisfaction = as.factor(if_else(RelationshipSatisfaction == 1, "Low",if_else(RelationshipSatisfaction == 2, "Medium", if_else(RelationshipSatisfaction == 3, "High", "Very High"))))
         ,WorkLifeBalance = as.factor(if_else(WorkLifeBalance == 1, "Bad",if_else(WorkLifeBalance == 2, "Good", if_else(WorkLifeBalance == 3, "Better", "Best"))))
         ,JobLevel = as.factor(JobLevel)
  ) %>%
  select(-EmployeeCount, -EmployeeNumber, -Over18, -StandardHours, -StockOptionLevel, -JobLevel)

#### Convertir las variables de clase a factor
implementa$Attrition = factor ( implementa$Attrition)
implementa$BusinessTravel = factor ( implementa$BusinessTravel)
implementa$EducationField = factor ( implementa$EducationField)
implementa$Department = factor ( implementa$Department)
implementa$Gender = factor ( implementa$Gender)
implementa$JobRole = factor ( implementa$JobRole)
implementa$MaritalStatus = factor ( implementa$MaritalStatus)
implementa$OverTime = factor ( implementa$OverTime)

#### Primer resumen de datos
summary(df)

##       Age        Attrition            BusinessTravel   DailyRate     
##  Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  
##  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  
##  Median :36.00              Travel_Rarely    :1043   Median : 802.0  
##  Mean   :36.92                                       Mean   : 802.5  
##  3rd Qu.:43.00                                       3rd Qu.:1157.0  
##  Max.   :60.00                                       Max.   :1499.0  
##                                                                      
##                   Department  DistanceFromHome         Education  
##  Human Resources       : 63   Min.   : 1.000   Bachelor     :572  
##  Research & Development:961   1st Qu.: 2.000   Below College:170  
##  Sales                 :446   Median : 7.000   College      :282  
##                               Mean   : 9.193   Doctor       : 48  
##                               3rd Qu.:14.000   Master       :398  
##                               Max.   :29.000                      
##                                                                   
##           EducationField EnvironmentSatisfaction    Gender      HourlyRate    
##  Human Resources : 27    High     :453           Female:588   Min.   : 30.00  
##  Life Sciences   :606    Low      :284           Male  :882   1st Qu.: 48.00  
##  Marketing       :159    Medium   :287                        Median : 66.00  
##  Medical         :464    Very High:446                        Mean   : 65.89  
##  Other           : 82                                         3rd Qu.: 83.75  
##  Technical Degree:132                                         Max.   :100.00  
##                                                                               
##    JobInvolvement                      JobRole     JobSatisfaction
##  High     :868    Sales Executive          :326   High     :442   
##  Low      : 83    Research Scientist       :292   Low      :289   
##  Medium   :375    Laboratory Technician    :259   Medium   :280   
##  Very High:144    Manufacturing Director   :145   Very High:459   
##                   Healthcare Representative:131                   
##                   Manager                  :102                   
##                   (Other)                  :215                   
##   MaritalStatus MonthlyIncome    MonthlyRate    NumCompaniesWorked OverTime  
##  Divorced:327   Min.   : 1009   Min.   : 2094   Min.   :0.000      No :1054  
##  Married :673   1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000      Yes: 416  
##  Single  :470   Median : 4919   Median :14236   Median :2.000                
##                 Mean   : 6503   Mean   :14313   Mean   :2.693                
##                 3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000                
##                 Max.   :19999   Max.   :26999   Max.   :9.000                
##                                                                              
##  PercentSalaryHike   PerformanceRating RelationshipSatisfaction
##  Min.   :11.00     Excellent  :1244    High     :459           
##  1st Qu.:12.00     Outstanding: 226    Low      :276           
##  Median :14.00                         Medium   :303           
##  Mean   :15.21                         Very High:432           
##  3rd Qu.:18.00                                                 
##  Max.   :25.00                                                 
##                                                                
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Bad   : 80      Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         Best  :153      1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Better:893      Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Good  :344      Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000                         3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000                         Max.   :40.000  
##                                                                          
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000      
##

str(df)

## 'data.frame':    1470 obs. of  29 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : Factor w/ 5 levels "Bachelor","Below College",..: 3 2 3 5 2 3 1 2 1 1 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : Factor w/ 4 levels "High","Low","Medium",..: 3 1 4 4 2 4 1 4 4 1 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : Factor w/ 4 levels "High","Low","Medium",..: 1 3 3 1 1 1 4 1 3 1 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : Factor w/ 4 levels "High","Low","Medium",..: 4 3 1 1 3 4 2 1 1 1 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : Factor w/ 2 levels "Excellent","Outstanding": 1 2 1 1 1 1 2 2 2 1 ...
##  $ RelationshipSatisfaction: Factor w/ 4 levels "High","Low","Medium",..: 2 4 3 1 4 1 2 3 3 3 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : Factor w/ 4 levels "Bad","Best","Better",..: 1 3 3 3 3 4 4 3 3 4 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

#Codificación de variables categóricas
dmy2 <- dummyVars(~., data = implementa[-2])
trsf2 <- data.frame(predict(dmy2, newdata = implementa[-2]))

#Removing Skewness

trsf2 <- trsf2%>%
  mutate(Age = log(Age + 1)
         ,DailyRate = log(DailyRate + 1)
         ,DistanceFromHome = log(DistanceFromHome + 1)
         ,HourlyRate = log(HourlyRate + 1)
         ,MonthlyIncome = log(MonthlyIncome + 1)
         ,MonthlyRate = log(MonthlyRate + 1)
         ,NumCompaniesWorked = log(NumCompaniesWorked + 1)
         ,PercentSalaryHike = log(PercentSalaryHike + 1)
         ,TotalWorkingYears = log(TotalWorkingYears + 1)
         ,TrainingTimesLastYear = log(TrainingTimesLastYear + 1)
         ,YearsAtCompany = log(YearsAtCompany +1)
         ,YearsInCurrentRole = log(YearsInCurrentRole + 1)
         ,YearsSinceLastPromotion = log(YearsSinceLastPromotion + 1)
         ,YearsWithCurrManager = log(YearsWithCurrManager + 1))

prep_num2 = preProcess(trsf2, method=c("center", "scale"))
final_implementa = predict(prep_num2, trsf2)

# Mover las correlaciones más altas

cor_mat<- cor(final_implementa)
high_corr <- findCorrelation(cor_mat, cutoff = 0.85)
names(trsf)[high_corr]

##  [1] "WorkLifeBalance.Bad"               "WorkLifeBalance.Best"             
##  [3] "JobSatisfaction.Low"               "NumCompaniesWorked"               
##  [5] "BusinessTravel.Non.Travel"         "DailyRate"                        
##  [7] "JobInvolvement.Medium"             "JobSatisfaction.High"             
##  [9] "EnvironmentSatisfaction.Very.High" "Department.Human.Resources"       
## [11] "MaritalStatus.Single"

# Se establece el dataset final.
final_implementa <- cbind(trsf, df[2])

final_implementa <- final_implementa %>%
  mutate(Attrition = if_else(Attrition == "Yes",1,0)) %>%
  select(-Department.Research...Development,-Department.Human.Resources,-PerformanceRating.Outstanding,-Gender.Male,-OverTime.Yes)

# Para ver la proporción de los datos de mi variable a predecir. 
table(final_df$Attrition)

## 
##    0    1 
## 1233  237

# Selección de variables


final_implementa <- final_implementa %>%
  select(JobRole.Sales.Representative, NumCompaniesWorked, BusinessTravel.Travel_Frequently, JobInvolvement.Low, YearsInCurrentRole, YearsWithCurrManager, MaritalStatus.Single, YearsAtCompany, TotalWorkingYears, Age, MonthlyIncome, OverTime.No, Attrition)
final_implementa$Attrition = factor ( final_implementa$Attrition)


#Predecir con 30 datos aleatoreos.Todos verdaderos positivos, es decir, que dejaron la compañía. 
#Vamos a utilizar el modelo Random Forest para ver el poder predictivo del mismo. 

set.seed(1)
final_implementa$Predicted<-predict(lr,final_implementa)

resultado <- subset(final_implementa, Predicted == 1)

head(resultado)

##     JobRole.Sales.Representative NumCompaniesWorked
## 3                              0          1.9459101
## 15                             0          1.7917595
## 27                             0          0.6931472
## 43                             0          0.6931472
## 103                            0          0.6931472
## 112                            0          0.6931472
##     BusinessTravel.Travel_Frequently JobInvolvement.Low YearsInCurrentRole
## 3                                  0                  0           0.000000
## 15                                 0                  0           1.098612
## 27                                 1                  1           1.098612
## 43                                 0                  1           0.000000
## 103                                1                  0           0.000000
## 112                                1                  1           2.079442
##     YearsWithCurrManager MaritalStatus.Single YearsAtCompany TotalWorkingYears
## 3              0.0000000                    1      0.0000000         2.0794415
## 15             1.3862944                    1      1.6094379         1.9459101
## 27             2.0794415                    1      2.3978953         2.3978953
## 43             0.6931472                    1      0.6931472         0.6931472
## 103            0.0000000                    1      0.6931472         0.6931472
## 112            1.9459101                    1      2.3025851         2.3025851
##          Age MonthlyIncome OverTime.No Attrition Predicted
## 3   3.637586      7.645398           0         1         1
## 15  3.367296      7.615298           0         1         1
## 27  3.496508      8.273847           0         1         1
## 43  3.295837      7.738052           1         1         1
## 103 3.044522      7.981733           0         1         1
## 112 3.555348      8.711937           0         1         1

Conclusión:

Se utilizó la regresión logístca, ya que tiene uno de los dos mejores Accuracy más su desempeño en los indicadores AUC- ROC. Al implementarlo con 30 casos aleatorios originales de la base..