En una economía “sana”, una cierta cantidad de rotación voluntaria de empleados está dentro de los rangos normales. Las personas cambian de trabajo por muchas razones: familia, conveniencia, compensación, oportunidades de crecimiento y más. Sin embargo, siempre es bueno comprender la motivos detrás de la rotación de las personas para que se puedan tomar las medidas adecuadas y controlar fuga de talentos. Todos los días las organizaciones están generando nuevos datos sobre sus colaboradores. Pero ¿podría usar estos datos con mayor efectividad para la gestión de personas?
¿Por qué es importante la rotación? El problema asocia múltiples costos:
Debido a todas las implicancias que conduce, si una persona renuncia, se activan los costos de reclutamiento, selección y contratación, ya que debemos reemplazar a la gente para mantener el proceso trabajando. Una vez que, entre la nueva persona, se debe entrenar y capacitar para no perder la productividad esperada.
En este punto se han perdido las habilidades que ya había desarrollado la otra persona que dejó la empresa. Según estudios, las compañías invierten entre 6 a 9 sueldos en encontrar, reemplazar y capacitar a un nuevo trabajador.
Con el objetivo de querer aportar en esta área algunas herramientas que permitan mejorar estos indicadores de gestión, les comparto un desarrollo de cómo implementar algunos algoritmos para la predicción de rotación de colaboradores y una implementación del algoritmo con 30 casos de la base original para evaluar el alcance predictivo.
El desarrollo no tiene la pretención de explicar descritivamente el comportamiento de las variables (el dataset es facilitado por IBM, Ver fuentes), ya que sabemos que cada empresa tiene realidades distintas. Sin embargo, el objetivo está puesto en cómo los avances en el aprendizaje automático y la ciencia de datos, pueden hacer posible no solo predecir la deserción de los empleados, sino también comprender las variables clave que influyen en la rotación.
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
paquetes <- c('data.table',
'dplyr',
'tidyr',
'ggplot2',
'randomForest',
'ROCR',
'purrr',
'smbinning',
'rpart',
'rpart.plot',
'skirm',
'readxl',
'rpart',
'missForest',
'xgboost',
'adabag',
'PerformanceAnalytics',
'corrplot',
'Boruta',
'e1071',
'kernlab',
'grid',
'Amelia',
'nnet',
'reshape2',
'leaps',
'foreign',
'pPROC',
'caret',
'discretization',
'DMwR',
'corrgram'
)
instalados <- paquetes %in% installed.packages()
if(sum(instalados == FALSE) > 0) {
install.packages(paquetes[!instalados])
}
lapply(paquetes,require,character.only = TRUE)
options(scipen=999)#Desactiva la notación científica
### Cargamos la base de datos
df <- fread('WA_Fn-UseC_-HR-Employee-Attrition.csv')
#Revisamos inicialmente la calidad de los datos.
str(df)
## Classes 'data.table' and 'data.frame': 1470 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr "Yes" "No" "Yes" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr "Female" "Male" "Male" "Female" ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
## - attr(*, ".internal.selfref")=<externalptr>
glimpse(df)
## Rows: 1,470
## Columns: 35
## $ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35...
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No"...
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Tra...
## $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 13...
## $ Department <chr> "Sales", "Research & Development", "Resear...
## $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 2...
## $ Education <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, ...
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other",...
## $ EmployeeCount <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ EmployeeNumber <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, ...
## $ EnvironmentSatisfaction <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, ...
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male"...
## $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84...
## $ JobInvolvement <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, ...
## $ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, ...
## $ JobRole <chr> "Sales Executive", "Research Scientist", "...
## $ JobSatisfaction <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, ...
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", ...
## $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, ...
## $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 11864, 9...
## $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, ...
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Ye...
## $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13...
## $ PerformanceRating <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, ...
## $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, ...
## $ StandardHours <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80...
## $ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, ...
## $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5...
## $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, ...
## $ WorkLifeBalance <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, ...
## $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2,...
## $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, ...
## $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, ...
## $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, ...
#Editamos el dataset con el objetivo de categorizar bien las variables.
df <- df %>%
mutate(Education = as.factor(if_else(Education == 1,"Below College", if_else(Education == 2, "College", if_else(Education == 3, "Bachelor", if_else(Education == 4, "Master","Doctor")))))
,EnvironmentSatisfaction = as.factor(if_else(EnvironmentSatisfaction == 1,"Low",if_else(EnvironmentSatisfaction == 2, "Medium", if_else(EnvironmentSatisfaction == 3, "High", "Very High"))))
,JobInvolvement = as.factor(if_else(JobInvolvement == 1,"Low",if_else(JobInvolvement == 2, "Medium",if_else(JobInvolvement == 3, "High", "Very High"))))
,JobSatisfaction = as.factor(if_else(JobSatisfaction == 1, "Low",if_else(JobSatisfaction == 2, "Medium",if_else(JobSatisfaction == 3, "High","Very High"))))
,PerformanceRating = as.factor(if_else(PerformanceRating == 1, "Low",if_else(PerformanceRating == 2, "Good", if_else(PerformanceRating == 3, "Excellent", "Outstanding"))))
,RelationshipSatisfaction = as.factor(if_else(RelationshipSatisfaction == 1, "Low",if_else(RelationshipSatisfaction == 2, "Medium", if_else(RelationshipSatisfaction == 3, "High", "Very High"))))
,WorkLifeBalance = as.factor(if_else(WorkLifeBalance == 1, "Bad",if_else(WorkLifeBalance == 2, "Good", if_else(WorkLifeBalance == 3, "Better", "Best"))))
,JobLevel = as.factor(JobLevel)
) %>%
select(-EmployeeCount, -EmployeeNumber, -Over18, -StandardHours, -StockOptionLevel, -JobLevel)
#### Convertir las variables de clase a factor
df$Attrition = factor ( df$Attrition)
df$BusinessTravel = factor ( df$BusinessTravel)
df$EducationField = factor ( df$EducationField)
df$Department = factor ( df$Department)
df$Gender = factor ( df$Gender)
df$JobRole = factor ( df$JobRole)
df$MaritalStatus = factor ( df$MaritalStatus)
df$OverTime = factor ( df$OverTime)
#### Primer resumen de datos
summary(df)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0
## 1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0
## Median :36.00 Travel_Rarely :1043 Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
##
## Department DistanceFromHome Education
## Human Resources : 63 Min. : 1.000 Bachelor :572
## Research & Development:961 1st Qu.: 2.000 Below College:170
## Sales :446 Median : 7.000 College :282
## Mean : 9.193 Doctor : 48
## 3rd Qu.:14.000 Master :398
## Max. :29.000
##
## EducationField EnvironmentSatisfaction Gender HourlyRate
## Human Resources : 27 High :453 Female:588 Min. : 30.00
## Life Sciences :606 Low :284 Male :882 1st Qu.: 48.00
## Marketing :159 Medium :287 Median : 66.00
## Medical :464 Very High:446 Mean : 65.89
## Other : 82 3rd Qu.: 83.75
## Technical Degree:132 Max. :100.00
##
## JobInvolvement JobRole JobSatisfaction
## High :868 Sales Executive :326 High :442
## Low : 83 Research Scientist :292 Low :289
## Medium :375 Laboratory Technician :259 Medium :280
## Very High:144 Manufacturing Director :145 Very High:459
## Healthcare Representative:131
## Manager :102
## (Other) :215
## MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime
## Divorced:327 Min. : 1009 Min. : 2094 Min. :0.000 No :1054
## Married :673 1st Qu.: 2911 1st Qu.: 8047 1st Qu.:1.000 Yes: 416
## Single :470 Median : 4919 Median :14236 Median :2.000
## Mean : 6503 Mean :14313 Mean :2.693
## 3rd Qu.: 8379 3rd Qu.:20462 3rd Qu.:4.000
## Max. :19999 Max. :26999 Max. :9.000
##
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## Min. :11.00 Excellent :1244 High :459
## 1st Qu.:12.00 Outstanding: 226 Low :276
## Median :14.00 Medium :303
## Mean :15.21 Very High:432
## 3rd Qu.:18.00
## Max. :25.00
##
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Min. : 0.00 Min. :0.000 Bad : 80 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.:2.000 Best :153 1st Qu.: 3.000
## Median :10.00 Median :3.000 Better:893 Median : 5.000
## Mean :11.28 Mean :2.799 Good :344 Mean : 7.008
## 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.: 9.000
## Max. :40.00 Max. :6.000 Max. :40.000
##
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 3.000 Median : 1.000 Median : 3.000
## Mean : 4.229 Mean : 2.188 Mean : 4.123
## 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :18.000 Max. :15.000 Max. :17.000
##
str(df)
## 'data.frame': 1470 obs. of 29 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : Factor w/ 5 levels "Bachelor","Below College",..: 3 2 3 5 2 3 1 2 1 1 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EnvironmentSatisfaction : Factor w/ 4 levels "High","Low","Medium",..: 3 1 4 4 2 4 1 4 4 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : Factor w/ 4 levels "High","Low","Medium",..: 1 3 3 1 1 1 4 1 3 1 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : Factor w/ 4 levels "High","Low","Medium",..: 4 3 1 1 3 4 2 1 1 1 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : Factor w/ 2 levels "Excellent","Outstanding": 1 2 1 1 1 1 2 2 2 1 ...
## $ RelationshipSatisfaction: Factor w/ 4 levels "High","Low","Medium",..: 2 4 3 1 4 1 2 3 3 3 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : Factor w/ 4 levels "Bad","Best","Better",..: 1 3 3 3 3 4 4 3 3 4 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
data.frame(colSums(is.na(df)))
## colSums.is.na.df..
## Age 0
## Attrition 0
## BusinessTravel 0
## DailyRate 0
## Department 0
## DistanceFromHome 0
## Education 0
## EducationField 0
## EnvironmentSatisfaction 0
## Gender 0
## HourlyRate 0
## JobInvolvement 0
## JobRole 0
## JobSatisfaction 0
## MaritalStatus 0
## MonthlyIncome 0
## MonthlyRate 0
## NumCompaniesWorked 0
## OverTime 0
## PercentSalaryHike 0
## PerformanceRating 0
## RelationshipSatisfaction 0
## TotalWorkingYears 0
## TrainingTimesLastYear 0
## WorkLifeBalance 0
## YearsAtCompany 0
## YearsInCurrentRole 0
## YearsSinceLastPromotion 0
## YearsWithCurrManager 0
out <- function(variable){
t(t(head(sort(variable,decreasing = T),20)))
}
lapply(df,function(x){
if(is.double(x)) out(x)
})
## $Age
## NULL
##
## $Attrition
## NULL
##
## $BusinessTravel
## NULL
##
## $DailyRate
## NULL
##
## $Department
## NULL
##
## $DistanceFromHome
## NULL
##
## $Education
## NULL
##
## $EducationField
## NULL
##
## $EnvironmentSatisfaction
## NULL
##
## $Gender
## NULL
##
## $HourlyRate
## NULL
##
## $JobInvolvement
## NULL
##
## $JobRole
## NULL
##
## $JobSatisfaction
## NULL
##
## $MaritalStatus
## NULL
##
## $MonthlyIncome
## NULL
##
## $MonthlyRate
## NULL
##
## $NumCompaniesWorked
## NULL
##
## $OverTime
## NULL
##
## $PercentSalaryHike
## NULL
##
## $PerformanceRating
## NULL
##
## $RelationshipSatisfaction
## NULL
##
## $TotalWorkingYears
## NULL
##
## $TrainingTimesLastYear
## NULL
##
## $WorkLifeBalance
## NULL
##
## $YearsAtCompany
## NULL
##
## $YearsInCurrentRole
## NULL
##
## $YearsSinceLastPromotion
## NULL
##
## $YearsWithCurrManager
## NULL
longi <- df %>%
summarise_all(mean) %>%
t() %>%
as.data.frame()
data.frame(variable = rownames(longi), media = longi$V1) %>%
arrange(desc(variable))
## variable media
## 1 YearsWithCurrManager 4.123129
## 2 YearsSinceLastPromotion 2.187755
## 3 YearsInCurrentRole 4.229252
## 4 YearsAtCompany 7.008163
## 5 WorkLifeBalance NA
## 6 TrainingTimesLastYear 2.799320
## 7 TotalWorkingYears 11.279592
## 8 RelationshipSatisfaction NA
## 9 PerformanceRating NA
## 10 PercentSalaryHike 15.209524
## 11 OverTime NA
## 12 NumCompaniesWorked 2.693197
## 13 MonthlyRate 14313.103401
## 14 MonthlyIncome 6502.931293
## 15 MaritalStatus NA
## 16 JobSatisfaction NA
## 17 JobRole NA
## 18 JobInvolvement NA
## 19 HourlyRate 65.891156
## 20 Gender NA
## 21 EnvironmentSatisfaction NA
## 22 EducationField NA
## 23 Education NA
## 24 DistanceFromHome 9.192517
## 25 Department NA
## 26 DailyRate 802.485714
## 27 BusinessTravel NA
## 28 Attrition NA
## 29 Age 36.923810
df %>%
select_if(is.integer) %>%
gather() %>%
ggplot(aes(value)) + geom_density() + facet_wrap(~key,scales='free') +
theme(axis.text=element_text(size=6))
df %>%
select_if(is.factor) %>%
gather() %>%
ggplot(aes(value)) + geom_bar() + facet_wrap(~key,scales='free') +
theme(axis.text=element_text(size=6))
#Hacemos análisis de correlaciones
cor <- df %>%
select_if(is.integer) %>%
cor() %>%
round(digits = 2)
corrgram(cor,order=TRUE,lower.panel=panel.shade,upper.panel=panel.pie)
corrplot(cor, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
#Codificación de las variables categóricas
dmy <- dummyVars(~., data = df[-2])
trsf <- data.frame(predict(dmy, newdata = df[-2]))
#Revisión de asimetría de los datos y tratamiento de los mismos (Skewness)
trsf <- trsf %>%
mutate(Age = log(Age + 1)
,DailyRate = log(DailyRate + 1)
,DistanceFromHome = log(DistanceFromHome + 1)
,HourlyRate = log(HourlyRate + 1)
,MonthlyIncome = log(MonthlyIncome + 1)
,MonthlyRate = log(MonthlyRate + 1)
,NumCompaniesWorked = log(NumCompaniesWorked + 1)
,PercentSalaryHike = log(PercentSalaryHike + 1)
,TotalWorkingYears = log(TotalWorkingYears + 1)
,TrainingTimesLastYear = log(TrainingTimesLastYear + 1)
,YearsAtCompany = log(YearsAtCompany +1)
,YearsInCurrentRole = log(YearsInCurrentRole + 1)
,YearsSinceLastPromotion = log(YearsSinceLastPromotion + 1)
,YearsWithCurrManager = log(YearsWithCurrManager + 1))
prep_num = preProcess(trsf, method=c("center", "scale"))
final_df = predict(prep_num, trsf)
# Mover las correlaciones más altas
cor_mat<- cor(final_df)
high_corr <- findCorrelation(cor_mat, cutoff = 0.85)
names(trsf)[high_corr]
## [1] "Department.Research...Development" "Department.Human.Resources"
## [3] "PerformanceRating.Outstanding" "Gender.Male"
## [5] "OverTime.Yes"
# Se establece el dataset final.
final_df <- cbind(trsf, df[2])
final_df <- final_df %>%
mutate(Attrition = if_else(Attrition == "Yes",1,0)) %>%
select(-Department.Research...Development,-Department.Human.Resources,-PerformanceRating.Outstanding,-Gender.Male,-OverTime.Yes)
# Para ver la proporción de los datos de mi variable a predecir.
table(final_df$Attrition)
##
## 0 1
## 1233 237
https://www.datacamp.com/community/tutorials/feature-selection-R-boruta
set.seed(1)
dfboruta <- Boruta(Attrition ~ ., data = final_df, doTrace = 2, ntree = 100)
print(dfboruta)
## Boruta performed 99 iterations in 11.37955 secs.
## 11 attributes confirmed important: Age,
## BusinessTravel.Travel_Frequently, JobInvolvement.Low,
## MaritalStatus.Single, MonthlyIncome and 6 more;
## 50 attributes confirmed unimportant: BusinessTravel.Non.Travel,
## BusinessTravel.Travel_Rarely, DailyRate, Department.Sales,
## DistanceFromHome and 45 more;
## 3 tentative attributes left: EnvironmentSatisfaction.Low,
## JobRole.Sales.Representative, WorkLifeBalance.Bad;
dfboruta2 <- TentativeRoughFix(dfboruta)
print(dfboruta2)
## Boruta performed 99 iterations in 11.37955 secs.
## Tentatives roughfixed over the last 99 iterations.
## 12 attributes confirmed important: Age,
## BusinessTravel.Travel_Frequently, JobInvolvement.Low,
## JobRole.Sales.Representative, MaritalStatus.Single and 7 more;
## 52 attributes confirmed unimportant: BusinessTravel.Non.Travel,
## BusinessTravel.Travel_Rarely, DailyRate, Department.Sales,
## DistanceFromHome and 47 more;
plot(dfboruta2, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(dfboruta2$ImpHistory),function(i)
dfboruta2$ImpHistory[is.finite(dfboruta2$ImpHistory[,i]),i])
names(lz) <- colnames(dfboruta2$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(dfboruta2$ImpHistory), cex.axis = 0.7)
getSelectedAttributes(dfboruta2, withTentative = F)
## [1] "Age" "BusinessTravel.Travel_Frequently"
## [3] "JobInvolvement.Low" "JobRole.Sales.Representative"
## [5] "MaritalStatus.Single" "MonthlyIncome"
## [7] "NumCompaniesWorked" "OverTime.No"
## [9] "TotalWorkingYears" "YearsAtCompany"
## [11] "YearsInCurrentRole" "YearsWithCurrManager"
boruta3 <- attStats(dfboruta2)
print(boruta3)
## meanImp medianImp minImp
## Age 6.45037383 6.530997282 1.84622231
## BusinessTravel.Non.Travel 0.66175851 0.689340492 -1.65137238
## BusinessTravel.Travel_Frequently 2.64130944 2.701833435 -0.05560411
## BusinessTravel.Travel_Rarely 0.13162355 0.003317546 -1.75592670
## DailyRate 0.31154357 0.139747877 -2.03990911
## Department.Sales 0.71244019 0.637815646 -2.16985707
## DistanceFromHome 0.75770782 0.644062856 -1.67121743
## Education.Bachelor -0.25081264 -0.186592310 -1.64411056
## Education.Below.College 0.07047972 0.251581717 -1.54291262
## Education.College -0.85399705 -0.876644859 -1.61984938
## Education.Doctor 0.17890872 0.269538218 -1.43705903
## Education.Master 0.38311601 0.201395388 -2.18992597
## EducationField.Human.Resources -0.23663384 -0.199497220 -0.98849682
## EducationField.Life.Sciences -0.07838005 -0.363342721 -1.73405783
## EducationField.Marketing 1.10172919 0.934185018 -0.86706381
## EducationField.Medical 0.40665755 0.438838931 -1.85849159
## EducationField.Other -0.32291143 -0.725545945 -1.33021541
## EducationField.Technical.Degree 0.30467688 0.390316597 -1.57782682
## EnvironmentSatisfaction.High 0.39959524 0.853423427 -2.06174510
## EnvironmentSatisfaction.Low 2.54926587 2.570597075 -0.70243688
## EnvironmentSatisfaction.Medium -0.23622661 -0.538296025 -1.80623242
## EnvironmentSatisfaction.Very.High 0.23425753 0.082001356 -0.84509361
## Gender.Female 0.34505141 0.561749480 -2.38442074
## HourlyRate -0.08715462 -0.173416900 -1.36022476
## JobInvolvement.High 0.29150721 0.148906252 -1.34310413
## JobInvolvement.Low 2.76977909 2.843587540 -0.59972271
## JobInvolvement.Medium 0.16016674 0.046437885 -1.86032844
## JobInvolvement.Very.High 0.44058967 0.288121572 -0.93015133
## JobRole.Healthcare.Representative 0.12018904 0.176806488 -1.07207094
## JobRole.Human.Resources 0.06862365 0.229307584 -2.26747727
## JobRole.Laboratory.Technician 1.24686688 1.148519545 -1.27316406
## JobRole.Manager 0.47881201 0.446091151 -0.58402726
## JobRole.Manufacturing.Director 0.02995089 -0.308023953 -1.63986074
## JobRole.Research.Director 0.33274284 0.397837646 -1.11104768
## JobRole.Research.Scientist 0.82171704 0.777772881 -1.24504821
## JobRole.Sales.Executive 0.30630215 0.406670676 -1.26305038
## JobRole.Sales.Representative 2.23229485 2.208716340 -0.37922490
## JobSatisfaction.High -0.21486263 -0.214842846 -1.81975481
## JobSatisfaction.Low 0.71915220 1.093089606 -1.43247810
## JobSatisfaction.Medium -0.13231433 -0.012759739 -2.51176460
## JobSatisfaction.Very.High 1.32078729 1.397888103 -0.62277519
## MaritalStatus.Divorced 1.17517862 1.323703831 -0.43073392
## MaritalStatus.Married 0.68131273 0.772785665 -1.41641949
## MaritalStatus.Single 3.73288312 3.782982007 1.35300903
## MonthlyIncome 6.77735496 6.735880038 4.27654036
## MonthlyRate 0.42109402 0.284385716 -0.75403603
## NumCompaniesWorked 2.58440689 2.700131544 -1.01137499
## OverTime.No 11.22438420 11.418767140 7.16549830
## PercentSalaryHike 0.23137675 0.183057917 -1.64714211
## PerformanceRating.Excellent 0.17690256 0.151734821 -1.59279471
## RelationshipSatisfaction.High 0.44926549 0.334475196 -1.13472719
## RelationshipSatisfaction.Low 0.90230771 0.720613299 -0.99238267
## RelationshipSatisfaction.Medium -0.38284937 -0.570205696 -2.25260177
## RelationshipSatisfaction.Very.High -0.11290756 0.070548747 -2.27802251
## TotalWorkingYears 6.24915575 6.332090381 3.03471452
## TrainingTimesLastYear 0.13495252 0.109381597 -2.02196369
## WorkLifeBalance.Bad 1.49437690 1.599646164 -1.20697132
## WorkLifeBalance.Best -0.22480104 -0.336018421 -1.51770288
## WorkLifeBalance.Better 1.02708622 1.081089971 -1.23825949
## WorkLifeBalance.Good 0.73280866 0.656039742 -1.69121467
## YearsAtCompany 4.66803502 4.759493182 1.80506490
## YearsInCurrentRole 3.11357831 3.260915699 -0.01330695
## YearsSinceLastPromotion 0.82508820 0.782770163 -1.41339737
## YearsWithCurrManager 3.81419935 3.747453957 1.44246269
## maxImp normHits decision
## Age 10.9580559 0.98989899 Confirmed
## BusinessTravel.Non.Travel 2.2357841 0.00000000 Rejected
## BusinessTravel.Travel_Frequently 5.4780572 0.69696970 Confirmed
## BusinessTravel.Travel_Rarely 2.3066057 0.00000000 Rejected
## DailyRate 2.7387658 0.03030303 Rejected
## Department.Sales 2.3629386 0.00000000 Rejected
## DistanceFromHome 2.7829193 0.05050505 Rejected
## Education.Bachelor 1.2833262 0.00000000 Rejected
## Education.Below.College 1.2966277 0.00000000 Rejected
## Education.College 0.1599436 0.00000000 Rejected
## Education.Doctor 1.4792676 0.00000000 Rejected
## Education.Master 1.7291497 0.00000000 Rejected
## EducationField.Human.Resources 1.0409924 0.00000000 Rejected
## EducationField.Life.Sciences 2.0717543 0.01010101 Rejected
## EducationField.Marketing 3.4414443 0.06060606 Rejected
## EducationField.Medical 2.3976384 0.01010101 Rejected
## EducationField.Other 1.6734537 0.00000000 Rejected
## EducationField.Technical.Degree 1.9875480 0.01010101 Rejected
## EnvironmentSatisfaction.High 2.1655299 0.01010101 Rejected
## EnvironmentSatisfaction.Low 6.3113315 0.66666667 Rejected
## EnvironmentSatisfaction.Medium 2.0659808 0.00000000 Rejected
## EnvironmentSatisfaction.Very.High 2.2082436 0.01010101 Rejected
## Gender.Female 2.4979122 0.00000000 Rejected
## HourlyRate 1.7804527 0.00000000 Rejected
## JobInvolvement.High 1.9865370 0.00000000 Rejected
## JobInvolvement.Low 5.5357794 0.75757576 Confirmed
## JobInvolvement.Medium 1.8879255 0.00000000 Rejected
## JobInvolvement.Very.High 2.3413774 0.01010101 Rejected
## JobRole.Healthcare.Representative 1.5884720 0.00000000 Rejected
## JobRole.Human.Resources 1.3174662 0.00000000 Rejected
## JobRole.Laboratory.Technician 3.3359541 0.29292929 Rejected
## JobRole.Manager 1.9342300 0.00000000 Rejected
## JobRole.Manufacturing.Director 2.5301214 0.01010101 Rejected
## JobRole.Research.Director 2.0561523 0.01010101 Rejected
## JobRole.Research.Scientist 3.0490228 0.02020202 Rejected
## JobRole.Sales.Executive 1.9648835 0.00000000 Rejected
## JobRole.Sales.Representative 5.0535741 0.59595960 Confirmed
## JobSatisfaction.High 1.0224796 0.00000000 Rejected
## JobSatisfaction.Low 1.6927892 0.00000000 Rejected
## JobSatisfaction.Medium 1.6460921 0.00000000 Rejected
## JobSatisfaction.Very.High 2.9631515 0.02020202 Rejected
## MaritalStatus.Divorced 2.6240354 0.03030303 Rejected
## MaritalStatus.Married 3.5704549 0.02020202 Rejected
## MaritalStatus.Single 5.7657113 0.93939394 Confirmed
## MonthlyIncome 9.5610155 1.00000000 Confirmed
## MonthlyRate 1.8250827 0.00000000 Rejected
## NumCompaniesWorked 5.2414963 0.68686869 Confirmed
## OverTime.No 15.8521886 1.00000000 Confirmed
## PercentSalaryHike 3.1324619 0.02020202 Rejected
## PerformanceRating.Excellent 1.8313203 0.00000000 Rejected
## RelationshipSatisfaction.High 2.4846029 0.00000000 Rejected
## RelationshipSatisfaction.Low 3.4082527 0.07070707 Rejected
## RelationshipSatisfaction.Medium 2.2937267 0.01010101 Rejected
## RelationshipSatisfaction.Very.High 1.6397116 0.00000000 Rejected
## TotalWorkingYears 8.4562704 1.00000000 Confirmed
## TrainingTimesLastYear 2.1098181 0.01010101 Rejected
## WorkLifeBalance.Bad 4.1949965 0.32323232 Rejected
## WorkLifeBalance.Best 1.2113968 0.00000000 Rejected
## WorkLifeBalance.Better 3.3183128 0.04040404 Rejected
## WorkLifeBalance.Good 3.0005942 0.01010101 Rejected
## YearsAtCompany 7.6981916 0.96969697 Confirmed
## YearsInCurrentRole 6.5876411 0.80808081 Confirmed
## YearsSinceLastPromotion 2.5591235 0.02020202 Rejected
## YearsWithCurrManager 6.6005233 0.92929293 Confirmed
final_df <- final_df %>%
select(JobRole.Sales.Representative, NumCompaniesWorked, BusinessTravel.Travel_Frequently, JobInvolvement.Low, YearsInCurrentRole, YearsWithCurrManager, MaritalStatus.Single, YearsAtCompany, TotalWorkingYears, Age, MonthlyIncome, OverTime.No, Attrition)
final_df$Attrition = factor ( final_df$Attrition)
table(final_df$Attrition)
##
## 0 1
## 1233 237
#Selección de dataset de entrenamiento y testeo
set.seed(1)
index <- createDataPartition(final_df[,1], p=0.75, list = FALSE)
train <- final_df[index,]
test <- final_df[-index,]
str(train)
## 'data.frame': 1103 obs. of 13 variables:
## $ JobRole.Sales.Representative : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NumCompaniesWorked : num 2.197 1.946 2.303 1.609 0.693 ...
## $ BusinessTravel.Travel_Frequently: num 0 0 0 0 0 1 0 0 0 0 ...
## $ JobInvolvement.Low : num 0 0 0 0 0 0 0 0 0 0 ...
## $ YearsInCurrentRole : num 1.61 0 1.1 0 0 ...
## $ YearsWithCurrManager : num 1.79 0 1.1 0 0 ...
## $ MaritalStatus.Single : num 1 1 0 0 0 1 0 0 0 1 ...
## $ YearsAtCompany : num 1.946 0 1.099 0.693 0.693 ...
## $ TotalWorkingYears : num 2.197 2.079 1.946 2.565 0.693 ...
## $ Age : num 3.74 3.64 3.33 4.09 3.43 ...
## $ MonthlyIncome : num 8.7 7.65 8.15 7.89 7.9 ...
## $ OverTime.No : num 0 0 1 0 1 1 1 1 1 0 ...
## $ Attrition : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 2 ...
prop.table(table(train$Attrition))
##
## 0 1
## 0.8404352 0.1595648
prop.table(table(test$Attrition))
##
## 0 1
## 0.8337875 0.1662125
set.seed(1)
forest <- randomForest(Attrition ~ .,data=train,importance=TRUE)
#evaluar en conjunto de prueba--------------------------------------------
predictionsrf <- predict(forest, newdata = test)
#matriz de confusion-------------------------------------------------------
confusionMatrix(predictionsrf,test$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 298 51
## 1 8 10
##
## Accuracy : 0.8392
## 95% CI : (0.7976, 0.8753)
## No Information Rate : 0.8338
## P-Value [Acc > NIR] : 0.4226
##
## Kappa : 0.192
##
## Mcnemar's Test P-Value : 0.00000004553
##
## Sensitivity : 0.9739
## Specificity : 0.1639
## Pos Pred Value : 0.8539
## Neg Pred Value : 0.5556
## Prevalence : 0.8338
## Detection Rate : 0.8120
## Detection Prevalence : 0.9510
## Balanced Accuracy : 0.5689
##
## 'Positive' Class : 0
##
importance(forest)
## 0 1 MeanDecreaseAccuracy
## JobRole.Sales.Representative 6.951697 4.112147 8.484580
## NumCompaniesWorked 10.884138 3.030282 11.406819
## BusinessTravel.Travel_Frequently 8.192549 4.713082 9.442856
## JobInvolvement.Low 9.075979 6.186014 10.712588
## YearsInCurrentRole 12.667339 8.919113 16.894636
## YearsWithCurrManager 10.270908 3.285121 12.490629
## MaritalStatus.Single 6.428801 11.840699 11.376491
## YearsAtCompany 14.899006 2.497170 16.933074
## TotalWorkingYears 17.668647 1.815352 18.491793
## Age 17.680327 7.886734 20.239531
## MonthlyIncome 14.950453 6.587204 18.089071
## OverTime.No 14.875886 22.511325 23.040566
## MeanDecreaseGini
## JobRole.Sales.Representative 5.327672
## NumCompaniesWorked 25.096104
## BusinessTravel.Travel_Frequently 10.860923
## JobInvolvement.Low 7.347062
## YearsInCurrentRole 18.673823
## YearsWithCurrManager 20.081600
## MaritalStatus.Single 12.632842
## YearsAtCompany 26.679911
## TotalWorkingYears 35.302102
## Age 40.490647
## MonthlyIncome 54.851854
## OverTime.No 19.300534
set.seed(1)
ROCRpred <- prediction(as.numeric(predictionsrf), as.numeric(test$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_rf <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_rf, digits=5, scientific=FALSE)))
set.seed(1)
arbol <- rpart(Attrition ~ ., data=train,method = "class") ### En Method se tiene que especificar, porque por defecto deja Gini ###
plot(arbol,uniform=T,margin=0.2)
text (arbol, use.n = T, pretty = TRUE)
title("Training Set's Classification Tree")
#evaluar en conjunto de prueba---------------------------------------------
predictionsar <- predict(arbol, test, type="class")
#matriz de confusion-------------------------------------------------------
confusionMatrix(predictionsar,test$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 299 50
## 1 7 11
##
## Accuracy : 0.8447
## 95% CI : (0.8035, 0.8802)
## No Information Rate : 0.8338
## P-Value [Acc > NIR] : 0.3161
##
## Kappa : 0.2194
##
## Mcnemar's Test P-Value : 0.00000002651
##
## Sensitivity : 0.9771
## Specificity : 0.1803
## Pos Pred Value : 0.8567
## Neg Pred Value : 0.6111
## Prevalence : 0.8338
## Detection Rate : 0.8147
## Detection Prevalence : 0.9510
## Balanced Accuracy : 0.5787
##
## 'Positive' Class : 0
##
set.seed(1)
plotcp(arbol)
ROCRpred <- prediction(as.numeric(predictionsar), as.numeric(test$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_dt <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_dt, digits=5, scientific=FALSE)))
set.seed(1)
svm <- ksvm(Attrition ~ ., data = train, kernel = "vanilladot", cross=10)
## Setting default kernel parameters
#evaluar en conjunto de prueba---------------------------------------------
predictionssvm <- predict(svm, test)
#matriz de confusion-------------------------------------------------------
confusionMatrix(predictionssvm,test$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 306 61
## 1 0 0
##
## Accuracy : 0.8338
## 95% CI : (0.7917, 0.8704)
## No Information Rate : 0.8338
## P-Value [Acc > NIR] : 0.5341
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.00000000000001564
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.8338
## Neg Pred Value : NaN
## Prevalence : 0.8338
## Detection Rate : 0.8338
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
set.seed(1)
ROCRpred <- prediction(as.numeric(predictionssvm), as.numeric(test$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_rf <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_rf, digits=5, scientific=FALSE)))
set.seed(1)
control <- trainControl(method="repeatedcv", number=5)
xgb <- train(Attrition~., data=train, method="xgbTree", trControl=control)
#evaluar en conjunto de prueba---------------------------------------------
predictionsxgb <- predict(xgb, test)
#matriz de confusion-------------------------------------------------------
confusionMatrix(predictionsxgb,test$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 296 45
## 1 10 16
##
## Accuracy : 0.8501
## 95% CI : (0.8094, 0.8851)
## No Information Rate : 0.8338
## P-Value [Acc > NIR] : 0.2222
##
## Kappa : 0.2981
##
## Mcnemar's Test P-Value : 0.000004549
##
## Sensitivity : 0.9673
## Specificity : 0.2623
## Pos Pred Value : 0.8680
## Neg Pred Value : 0.6154
## Prevalence : 0.8338
## Detection Rate : 0.8065
## Detection Prevalence : 0.9292
## Balanced Accuracy : 0.6148
##
## 'Positive' Class : 0
##
set.seed(1)
ROCRpred <- prediction(as.numeric(predictionsxgb), as.numeric(test$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_xgb <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_xgb, digits=5, scientific=FALSE)))
set.seed(1)
control <- trainControl(method="repeatedcv", number=5)
lr <- train(Attrition~., data=train, method="glm", trControl=control)
#evaluar en conjunto de prueba---------------------------------------------
predictionslr <- predict(lr, test)
#matriz de confusion-------------------------------------------------------
confusionMatrix(predictionslr,test$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 301 48
## 1 5 13
##
## Accuracy : 0.8556
## 95% CI : (0.8154, 0.8899)
## No Information Rate : 0.8338
## P-Value [Acc > NIR] : 0.146
##
## Kappa : 0.2741
##
## Mcnemar's Test P-Value : 0.000000007968
##
## Sensitivity : 0.9837
## Specificity : 0.2131
## Pos Pred Value : 0.8625
## Neg Pred Value : 0.7222
## Prevalence : 0.8338
## Detection Rate : 0.8202
## Detection Prevalence : 0.9510
## Balanced Accuracy : 0.5984
##
## 'Positive' Class : 0
##
set.seed(1)
ROCRpred <- prediction(as.numeric(predictionslr), as.numeric(test$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_lr <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_lr, digits=5, scientific=FALSE)))
# Tabulating accuracies
Model <- c('Random Forest','Decision tree','SVM-vanilladot', 'XGB', 'Logistic Regresion')
Accuracy <- c(83.92,84.47,83.38,85.01,85.56)
AUC <- c(56.89,57.87,50,61.48,59.83)
mytable<- data.frame(Model, Accuracy, AUC)
head(mytable)
## Model Accuracy AUC
## 1 Random Forest 83.92 56.89
## 2 Decision tree 84.47 57.87
## 3 SVM-vanilladot 83.38 50.00
## 4 XGB 85.01 61.48
## 5 Logistic Regresion 85.56 59.83
set.seed(1)
implementa <- fread('implementación2020.csv') ### cargar el nuevo dataset
#Editamos el dataset con el objetivo de categorizar bien las variables.
implementa <- implementa %>%
mutate(Education = as.factor(if_else(Education == 1,"Below College", if_else(Education == 2, "College", if_else(Education == 3, "Bachelor", if_else(Education == 4, "Master","Doctor")))))
,EnvironmentSatisfaction = as.factor(if_else(EnvironmentSatisfaction == 1,"Low",if_else(EnvironmentSatisfaction == 2, "Medium", if_else(EnvironmentSatisfaction == 3, "High", "Very High"))))
,JobInvolvement = as.factor(if_else(JobInvolvement == 1,"Low",if_else(JobInvolvement == 2, "Medium",if_else(JobInvolvement == 3, "High", "Very High"))))
,JobSatisfaction = as.factor(if_else(JobSatisfaction == 1, "Low",if_else(JobSatisfaction == 2, "Medium",if_else(JobSatisfaction == 3, "High","Very High"))))
,PerformanceRating = as.factor(if_else(PerformanceRating == 1, "Low",if_else(PerformanceRating == 2, "Good", if_else(PerformanceRating == 3, "Excellent", "Outstanding"))))
,RelationshipSatisfaction = as.factor(if_else(RelationshipSatisfaction == 1, "Low",if_else(RelationshipSatisfaction == 2, "Medium", if_else(RelationshipSatisfaction == 3, "High", "Very High"))))
,WorkLifeBalance = as.factor(if_else(WorkLifeBalance == 1, "Bad",if_else(WorkLifeBalance == 2, "Good", if_else(WorkLifeBalance == 3, "Better", "Best"))))
,JobLevel = as.factor(JobLevel)
) %>%
select(-EmployeeCount, -EmployeeNumber, -Over18, -StandardHours, -StockOptionLevel, -JobLevel)
#### Convertir las variables de clase a factor
implementa$Attrition = factor ( implementa$Attrition)
implementa$BusinessTravel = factor ( implementa$BusinessTravel)
implementa$EducationField = factor ( implementa$EducationField)
implementa$Department = factor ( implementa$Department)
implementa$Gender = factor ( implementa$Gender)
implementa$JobRole = factor ( implementa$JobRole)
implementa$MaritalStatus = factor ( implementa$MaritalStatus)
implementa$OverTime = factor ( implementa$OverTime)
#### Primer resumen de datos
summary(df)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0
## 1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0
## Median :36.00 Travel_Rarely :1043 Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
##
## Department DistanceFromHome Education
## Human Resources : 63 Min. : 1.000 Bachelor :572
## Research & Development:961 1st Qu.: 2.000 Below College:170
## Sales :446 Median : 7.000 College :282
## Mean : 9.193 Doctor : 48
## 3rd Qu.:14.000 Master :398
## Max. :29.000
##
## EducationField EnvironmentSatisfaction Gender HourlyRate
## Human Resources : 27 High :453 Female:588 Min. : 30.00
## Life Sciences :606 Low :284 Male :882 1st Qu.: 48.00
## Marketing :159 Medium :287 Median : 66.00
## Medical :464 Very High:446 Mean : 65.89
## Other : 82 3rd Qu.: 83.75
## Technical Degree:132 Max. :100.00
##
## JobInvolvement JobRole JobSatisfaction
## High :868 Sales Executive :326 High :442
## Low : 83 Research Scientist :292 Low :289
## Medium :375 Laboratory Technician :259 Medium :280
## Very High:144 Manufacturing Director :145 Very High:459
## Healthcare Representative:131
## Manager :102
## (Other) :215
## MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime
## Divorced:327 Min. : 1009 Min. : 2094 Min. :0.000 No :1054
## Married :673 1st Qu.: 2911 1st Qu.: 8047 1st Qu.:1.000 Yes: 416
## Single :470 Median : 4919 Median :14236 Median :2.000
## Mean : 6503 Mean :14313 Mean :2.693
## 3rd Qu.: 8379 3rd Qu.:20462 3rd Qu.:4.000
## Max. :19999 Max. :26999 Max. :9.000
##
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## Min. :11.00 Excellent :1244 High :459
## 1st Qu.:12.00 Outstanding: 226 Low :276
## Median :14.00 Medium :303
## Mean :15.21 Very High:432
## 3rd Qu.:18.00
## Max. :25.00
##
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Min. : 0.00 Min. :0.000 Bad : 80 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.:2.000 Best :153 1st Qu.: 3.000
## Median :10.00 Median :3.000 Better:893 Median : 5.000
## Mean :11.28 Mean :2.799 Good :344 Mean : 7.008
## 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.: 9.000
## Max. :40.00 Max. :6.000 Max. :40.000
##
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 3.000 Median : 1.000 Median : 3.000
## Mean : 4.229 Mean : 2.188 Mean : 4.123
## 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :18.000 Max. :15.000 Max. :17.000
##
str(df)
## 'data.frame': 1470 obs. of 29 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : Factor w/ 5 levels "Bachelor","Below College",..: 3 2 3 5 2 3 1 2 1 1 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EnvironmentSatisfaction : Factor w/ 4 levels "High","Low","Medium",..: 3 1 4 4 2 4 1 4 4 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : Factor w/ 4 levels "High","Low","Medium",..: 1 3 3 1 1 1 4 1 3 1 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : Factor w/ 4 levels "High","Low","Medium",..: 4 3 1 1 3 4 2 1 1 1 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : Factor w/ 2 levels "Excellent","Outstanding": 1 2 1 1 1 1 2 2 2 1 ...
## $ RelationshipSatisfaction: Factor w/ 4 levels "High","Low","Medium",..: 2 4 3 1 4 1 2 3 3 3 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : Factor w/ 4 levels "Bad","Best","Better",..: 1 3 3 3 3 4 4 3 3 4 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
#Codificación de variables categóricas
dmy2 <- dummyVars(~., data = implementa[-2])
trsf2 <- data.frame(predict(dmy2, newdata = implementa[-2]))
#Removing Skewness
trsf2 <- trsf2%>%
mutate(Age = log(Age + 1)
,DailyRate = log(DailyRate + 1)
,DistanceFromHome = log(DistanceFromHome + 1)
,HourlyRate = log(HourlyRate + 1)
,MonthlyIncome = log(MonthlyIncome + 1)
,MonthlyRate = log(MonthlyRate + 1)
,NumCompaniesWorked = log(NumCompaniesWorked + 1)
,PercentSalaryHike = log(PercentSalaryHike + 1)
,TotalWorkingYears = log(TotalWorkingYears + 1)
,TrainingTimesLastYear = log(TrainingTimesLastYear + 1)
,YearsAtCompany = log(YearsAtCompany +1)
,YearsInCurrentRole = log(YearsInCurrentRole + 1)
,YearsSinceLastPromotion = log(YearsSinceLastPromotion + 1)
,YearsWithCurrManager = log(YearsWithCurrManager + 1))
prep_num2 = preProcess(trsf2, method=c("center", "scale"))
final_implementa = predict(prep_num2, trsf2)
# Mover las correlaciones más altas
cor_mat<- cor(final_implementa)
high_corr <- findCorrelation(cor_mat, cutoff = 0.85)
names(trsf)[high_corr]
## [1] "WorkLifeBalance.Bad" "WorkLifeBalance.Best"
## [3] "JobSatisfaction.Low" "NumCompaniesWorked"
## [5] "BusinessTravel.Non.Travel" "DailyRate"
## [7] "JobInvolvement.Medium" "JobSatisfaction.High"
## [9] "EnvironmentSatisfaction.Very.High" "Department.Human.Resources"
## [11] "MaritalStatus.Single"
# Se establece el dataset final.
final_implementa <- cbind(trsf, df[2])
final_implementa <- final_implementa %>%
mutate(Attrition = if_else(Attrition == "Yes",1,0)) %>%
select(-Department.Research...Development,-Department.Human.Resources,-PerformanceRating.Outstanding,-Gender.Male,-OverTime.Yes)
# Para ver la proporción de los datos de mi variable a predecir.
table(final_df$Attrition)
##
## 0 1
## 1233 237
# Selección de variables
final_implementa <- final_implementa %>%
select(JobRole.Sales.Representative, NumCompaniesWorked, BusinessTravel.Travel_Frequently, JobInvolvement.Low, YearsInCurrentRole, YearsWithCurrManager, MaritalStatus.Single, YearsAtCompany, TotalWorkingYears, Age, MonthlyIncome, OverTime.No, Attrition)
final_implementa$Attrition = factor ( final_implementa$Attrition)
#Predecir con 30 datos aleatoreos.Todos verdaderos positivos, es decir, que dejaron la compañía.
#Vamos a utilizar el modelo Random Forest para ver el poder predictivo del mismo.
set.seed(1)
final_implementa$Predicted<-predict(lr,final_implementa)
resultado <- subset(final_implementa, Predicted == 1)
head(resultado)
## JobRole.Sales.Representative NumCompaniesWorked
## 3 0 1.9459101
## 15 0 1.7917595
## 27 0 0.6931472
## 43 0 0.6931472
## 103 0 0.6931472
## 112 0 0.6931472
## BusinessTravel.Travel_Frequently JobInvolvement.Low YearsInCurrentRole
## 3 0 0 0.000000
## 15 0 0 1.098612
## 27 1 1 1.098612
## 43 0 1 0.000000
## 103 1 0 0.000000
## 112 1 1 2.079442
## YearsWithCurrManager MaritalStatus.Single YearsAtCompany TotalWorkingYears
## 3 0.0000000 1 0.0000000 2.0794415
## 15 1.3862944 1 1.6094379 1.9459101
## 27 2.0794415 1 2.3978953 2.3978953
## 43 0.6931472 1 0.6931472 0.6931472
## 103 0.0000000 1 0.6931472 0.6931472
## 112 1.9459101 1 2.3025851 2.3025851
## Age MonthlyIncome OverTime.No Attrition Predicted
## 3 3.637586 7.645398 0 1 1
## 15 3.367296 7.615298 0 1 1
## 27 3.496508 8.273847 0 1 1
## 43 3.295837 7.738052 1 1 1
## 103 3.044522 7.981733 0 1 1
## 112 3.555348 8.711937 0 1 1
Conclusión:
Se utilizó la regresión logístca, ya que tiene uno de los dos mejores Accuracy más su desempeño en los indicadores AUC- ROC. Al implementarlo con 30 casos aleatorios originales de la base..