Introduction

This analysis looks at the topic of binary classification and the topics of sensitivity and specificity.Topics from the previous chapter such as the creation of binomial models were also used in this chapter. Topic that were mentioned in chapter 8 were also implemented in this analysis. The data set used in this analysis was EmployeeSatisfaction.csv and looked at people who left the company.

library(tidyverse)
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
EmployeeSatisfaction <- read_csv("EmployeeSatisfaction.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_double(),
##   satisfaction_level = col_double(),
##   last_evaluation_score = col_double(),
##   number_of_projects = col_double(),
##   average_monthly_hours = col_double(),
##   years_spent_at_company = col_double(),
##   work_accident = col_double(),
##   left_company = col_double(),
##   promotion_in_last_5years = col_double(),
##   department = col_character(),
##   salary = col_character(),
##   salary_range = col_character()
## )
summary(EmployeeSatisfaction)
##   employee_id    satisfaction_level last_evaluation_score number_of_projects
##  Min.   :    1   Min.   :0.0900     Min.   :0.3600        Min.   :2.000     
##  1st Qu.: 3751   1st Qu.:0.4400     1st Qu.:0.5600        1st Qu.:3.000     
##  Median : 7500   Median :0.6400     Median :0.7200        Median :4.000     
##  Mean   : 7500   Mean   :0.6128     Mean   :0.7161        Mean   :3.803     
##  3rd Qu.:11250   3rd Qu.:0.8200     3rd Qu.:0.8700        3rd Qu.:5.000     
##  Max.   :15000   Max.   :1.0000     Max.   :1.0000        Max.   :7.000     
##  average_monthly_hours years_spent_at_company work_accident     left_company   
##  Min.   : 96           Min.   : 2.000         Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:156           1st Qu.: 3.000         1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :200           Median : 3.000         Median :0.0000   Median :0.0000  
##  Mean   :201           Mean   : 3.498         Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.:245           3rd Qu.: 4.000         3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :310           Max.   :10.000         Max.   :1.0000   Max.   :1.0000  
##  promotion_in_last_5years  department           salary         
##  Min.   :0.00000          Length:15000       Length:15000      
##  1st Qu.:0.00000          Class :character   Class :character  
##  Median :0.00000          Mode  :character   Mode  :character  
##  Mean   :0.02127                                               
##  3rd Qu.:0.00000                                               
##  Max.   :1.00000                                               
##  salary_range      
##  Length:15000      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
table(EmployeeSatisfaction$salary)
## 
##   high    low medium 
##   1237   7317   6446
table(EmployeeSatisfaction$salary_range)
## 
##    $45,000 - $74,999 Greater than $75,000    Less than $45,000 
##                 6446                 1237                 7317
table(EmployeeSatisfaction$left_company)
## 
##     0     1 
## 11428  3572
table(EmployeeSatisfaction$department)
## 
##  accounting          hr          IT  management   marketing product_mng 
##         767         739        1227         630         858         902 
##       RandD       sales     support   technical 
##         787        4140        2230        2720
EmployeeSatisfaction <-EmployeeSatisfaction[-c(12)]
XEmployeeSatisfaction<- model.matrix(left_company~.,data=EmployeeSatisfaction)[,-c(1, 2)]
XEmployeeSatisfaction[1:3,]
##   satisfaction_level last_evaluation_score number_of_projects
## 1               0.38                  0.53                  2
## 2               0.80                  0.86                  5
## 3               0.11                  0.88                  7
##   average_monthly_hours years_spent_at_company work_accident
## 1                   157                      3             0
## 2                   262                      6             0
## 3                   272                      4             0
##   promotion_in_last_5years departmenthr departmentIT departmentmanagement
## 1                        0            0            0                    0
## 2                        0            0            0                    0
## 3                        0            0            0                    0
##   departmentmarketing departmentproduct_mng departmentRandD departmentsales
## 1                   0                     0               0               1
## 2                   0                     0               0               1
## 3                   0                     0               0               1
##   departmentsupport departmenttechnical salarylow salarymedium
## 1                 0                   0         1            0
## 2                 0                   0         0            1
## 3                 0                   0         0            1
set.seed(1)
train <- sample(1:15000,11250)
xtrain <- XEmployeeSatisfaction[train,]
xnew <- XEmployeeSatisfaction[-train,]
ytrain <- EmployeeSatisfaction$left_company[train]
ynew <- EmployeeSatisfaction$left_company[-train]
leftglm=glm(left_company~.,family=binomial,data=data.frame(left_company=ytrain,xtrain))
summary(leftglm)
## 
## Call:
## glm(formula = left_company ~ ., family = binomial, data = data.frame(left_company = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1772  -0.6670  -0.4096  -0.1179   3.0774  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -1.2763798  0.2161009  -5.906 3.50e-09 ***
## satisfaction_level       -4.1041921  0.1121258 -36.603  < 2e-16 ***
## last_evaluation_score     0.7493810  0.1712010   4.377 1.20e-05 ***
## number_of_projects       -0.3057140  0.0243303 -12.565  < 2e-16 ***
## average_monthly_hours     0.0041024  0.0005915   6.936 4.04e-12 ***
## years_spent_at_company    0.2560390  0.0177992  14.385  < 2e-16 ***
## work_accident            -1.5171325  0.1025541 -14.793  < 2e-16 ***
## promotion_in_last_5years -1.4497472  0.3104691  -4.670 3.02e-06 ***
## departmenthr              0.1928059  0.1502022   1.284 0.199267    
## departmentIT             -0.1908957  0.1395694  -1.368 0.171391    
## departmentmanagement     -0.4216941  0.1813021  -2.326 0.020023 *  
## departmentmarketing      -0.0870040  0.1524171  -0.571 0.568116    
## departmentproduct_mng    -0.1961229  0.1481609  -1.324 0.185598    
## departmentRandD          -0.5373096  0.1629124  -3.298 0.000973 ***
## departmentsales          -0.0826802  0.1162672  -0.711 0.477009    
## departmentsupport        -0.0009960  0.1240977  -0.008 0.993596    
## departmenttechnical       0.0672144  0.1211612   0.555 0.579065    
## salarylow                 1.8260563  0.1404802  12.999  < 2e-16 ***
## salarymedium              1.3052384  0.1413487   9.234  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12399.3  on 11249  degrees of freedom
## Residual deviance:  9723.3  on 11231  degrees of freedom
## AIC: 9761.3
## 
## Number of Fisher Scoring iterations: 5
truepos <- ynew==1 & ptest>=cut 
trueneg <- ynew==0 & ptest<cut
sum(truepos)/sum(ynew==1)
## [1] 0.3692661
sum(trueneg)/sum(ynew==0) 
## [1] 0.9277276

Conclusion

COlumn 12 was removed from the data set since it gave N/A when running the prediction model. The first column was removed since ID number is irrelevant to the analysis. The probably cut off 1/2 was chosen since the likeliness of a false negative is that it either happens or it doesnโ€™t. This resulted in a sensitivity of 36.9% and specificity of 92.7%. The likeliness of an employee not quitting is 92.7% and the likeliness of an employee quitting is 36.9 %. I attempted to make an ROC curve, but I ran into a few issues when tying to make on of them. My initial plan was to include one. Overall, the binomial classification showed that at a specified place, employees are more likely to stay on rather than quit their job.