This analysis looks at the topic of binary classification and the topics of sensitivity and specificity.Topics from the previous chapter such as the creation of binomial models were also used in this chapter. Topic that were mentioned in chapter 8 were also implemented in this analysis. The data set used in this analysis was EmployeeSatisfaction.csv and looked at people who left the company.
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
EmployeeSatisfaction <- read_csv("EmployeeSatisfaction.csv")
## Parsed with column specification:
## cols(
## employee_id = col_double(),
## satisfaction_level = col_double(),
## last_evaluation_score = col_double(),
## number_of_projects = col_double(),
## average_monthly_hours = col_double(),
## years_spent_at_company = col_double(),
## work_accident = col_double(),
## left_company = col_double(),
## promotion_in_last_5years = col_double(),
## department = col_character(),
## salary = col_character(),
## salary_range = col_character()
## )
summary(EmployeeSatisfaction)
## employee_id satisfaction_level last_evaluation_score number_of_projects
## Min. : 1 Min. :0.0900 Min. :0.3600 Min. :2.000
## 1st Qu.: 3751 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000
## Median : 7500 Median :0.6400 Median :0.7200 Median :4.000
## Mean : 7500 Mean :0.6128 Mean :0.7161 Mean :3.803
## 3rd Qu.:11250 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000
## Max. :15000 Max. :1.0000 Max. :1.0000 Max. :7.000
## average_monthly_hours years_spent_at_company work_accident left_company
## Min. : 96 Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:156 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :200 Median : 3.000 Median :0.0000 Median :0.0000
## Mean :201 Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.:245 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :310 Max. :10.000 Max. :1.0000 Max. :1.0000
## promotion_in_last_5years department salary
## Min. :0.00000 Length:15000 Length:15000
## 1st Qu.:0.00000 Class :character Class :character
## Median :0.00000 Mode :character Mode :character
## Mean :0.02127
## 3rd Qu.:0.00000
## Max. :1.00000
## salary_range
## Length:15000
## Class :character
## Mode :character
##
##
##
table(EmployeeSatisfaction$salary)
##
## high low medium
## 1237 7317 6446
table(EmployeeSatisfaction$salary_range)
##
## $45,000 - $74,999 Greater than $75,000 Less than $45,000
## 6446 1237 7317
table(EmployeeSatisfaction$left_company)
##
## 0 1
## 11428 3572
table(EmployeeSatisfaction$department)
##
## accounting hr IT management marketing product_mng
## 767 739 1227 630 858 902
## RandD sales support technical
## 787 4140 2230 2720
EmployeeSatisfaction <-EmployeeSatisfaction[-c(12)]
XEmployeeSatisfaction<- model.matrix(left_company~.,data=EmployeeSatisfaction)[,-c(1, 2)]
XEmployeeSatisfaction[1:3,]
## satisfaction_level last_evaluation_score number_of_projects
## 1 0.38 0.53 2
## 2 0.80 0.86 5
## 3 0.11 0.88 7
## average_monthly_hours years_spent_at_company work_accident
## 1 157 3 0
## 2 262 6 0
## 3 272 4 0
## promotion_in_last_5years departmenthr departmentIT departmentmanagement
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## departmentmarketing departmentproduct_mng departmentRandD departmentsales
## 1 0 0 0 1
## 2 0 0 0 1
## 3 0 0 0 1
## departmentsupport departmenttechnical salarylow salarymedium
## 1 0 0 1 0
## 2 0 0 0 1
## 3 0 0 0 1
set.seed(1)
train <- sample(1:15000,11250)
xtrain <- XEmployeeSatisfaction[train,]
xnew <- XEmployeeSatisfaction[-train,]
ytrain <- EmployeeSatisfaction$left_company[train]
ynew <- EmployeeSatisfaction$left_company[-train]
leftglm=glm(left_company~.,family=binomial,data=data.frame(left_company=ytrain,xtrain))
summary(leftglm)
##
## Call:
## glm(formula = left_company ~ ., family = binomial, data = data.frame(left_company = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1772 -0.6670 -0.4096 -0.1179 3.0774
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2763798 0.2161009 -5.906 3.50e-09 ***
## satisfaction_level -4.1041921 0.1121258 -36.603 < 2e-16 ***
## last_evaluation_score 0.7493810 0.1712010 4.377 1.20e-05 ***
## number_of_projects -0.3057140 0.0243303 -12.565 < 2e-16 ***
## average_monthly_hours 0.0041024 0.0005915 6.936 4.04e-12 ***
## years_spent_at_company 0.2560390 0.0177992 14.385 < 2e-16 ***
## work_accident -1.5171325 0.1025541 -14.793 < 2e-16 ***
## promotion_in_last_5years -1.4497472 0.3104691 -4.670 3.02e-06 ***
## departmenthr 0.1928059 0.1502022 1.284 0.199267
## departmentIT -0.1908957 0.1395694 -1.368 0.171391
## departmentmanagement -0.4216941 0.1813021 -2.326 0.020023 *
## departmentmarketing -0.0870040 0.1524171 -0.571 0.568116
## departmentproduct_mng -0.1961229 0.1481609 -1.324 0.185598
## departmentRandD -0.5373096 0.1629124 -3.298 0.000973 ***
## departmentsales -0.0826802 0.1162672 -0.711 0.477009
## departmentsupport -0.0009960 0.1240977 -0.008 0.993596
## departmenttechnical 0.0672144 0.1211612 0.555 0.579065
## salarylow 1.8260563 0.1404802 12.999 < 2e-16 ***
## salarymedium 1.3052384 0.1413487 9.234 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12399.3 on 11249 degrees of freedom
## Residual deviance: 9723.3 on 11231 degrees of freedom
## AIC: 9761.3
##
## Number of Fisher Scoring iterations: 5
truepos <- ynew==1 & ptest>=cut
trueneg <- ynew==0 & ptest<cut
sum(truepos)/sum(ynew==1)
## [1] 0.3692661
sum(trueneg)/sum(ynew==0)
## [1] 0.9277276
COlumn 12 was removed from the data set since it gave N/A when running the prediction model. The first column was removed since ID number is irrelevant to the analysis. The probably cut off 1/2 was chosen since the likeliness of a false negative is that it either happens or it doesnโt. This resulted in a sensitivity of 36.9% and specificity of 92.7%. The likeliness of an employee not quitting is 92.7% and the likeliness of an employee quitting is 36.9 %. I attempted to make an ROC curve, but I ran into a few issues when tying to make on of them. My initial plan was to include one. Overall, the binomial classification showed that at a specified place, employees are more likely to stay on rather than quit their job.