This vignette is an introduction on how to apply logistic regression using the glm function in R. We’ll use be using the human resources analytic’s data set which can be acquired from kaggle.
First off we need to load in some of the necessary packages and the data which we will be using for this analysis.
library(readr)
library(dplyr)
library(ggplot2)
library(GGally)
library(pROC)
library(caret)
data <- read_csv("C:/Users/Jared Chung/Desktop/DataScienceNotes/uts_masters_files/HR_analytics/HR_comma_sep.csv")
Before we can start modelling we need to firstly have an idea as to what that looks like which can be achieved using head function.
head(data)
We can see from here there is a combination of categorical, numerical and integers data types.
The str function is also another way to get a breakdown of data set.
str(data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 14999 obs. of 10 variables:
$ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
$ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
$ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
$ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
$ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
$ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
$ left : int 1 1 1 1 1 1 1 1 1 1 ...
$ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
$ sales : chr "sales" "sales" "sales" "sales" ...
$ salary : chr "low" "medium" "medium" "low" ...
- attr(*, "spec")=List of 2
..$ cols :List of 10
.. ..$ satisfaction_level : list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ last_evaluation : list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ number_project : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ average_montly_hours : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ time_spend_company : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Work_accident : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ left : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ promotion_last_5years: list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ sales : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ salary : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
As much of the data set is numerical or an integers we can look at the descriptive statistics of each variable. This produces some useful information like the average monthly hours for each worker is 201 and the maximum time someone has spent working for this company is 10 years.
summary(data)
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left
Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0 Min. : 2.000 Min. :0.0000 Min. :0.0000
1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.6400 Median :0.7200 Median :4.000 Median :200.0 Median : 3.000 Median :0.0000 Median :0.0000
Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1 Mean : 3.498 Mean :0.1446 Mean :0.2381
3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0 Max. :10.000 Max. :1.0000 Max. :1.0000
promotion_last_5years sales salary
Min. :0.00000 Length:14999 Length:14999
1st Qu.:0.00000 Class :character Class :character
Median :0.00000 Mode :character Mode :character
Mean :0.02127
3rd Qu.:0.00000
Max. :1.00000
Although descriptive statistics are a good way to understand each variable, it is good practice to plot each variable. Using the ggpairs function from the GGally package we can quickly produce a mixed scatter plot matrix which gives us an enhanced view of our data set.
a <- ggpairs(data)
print(a,,progress = FALSE)
The correlation statistics allows us to understand whether some of the variables might be similar to each other. In many cases, when modelling it is important not to have variables which are highly correlated with each other as this can impair our models ability to predict. This plot shows that there isn’t any major correlation between variables.
After going through the initial steps we can finally start modelling. We will be using the “Binomial” family because we are trying to predict a binary outcome which is whether an employee will leave the company.
summary(model)
Call:
glm(formula = left ~ ., family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2479 -0.6680 -0.4071 -0.1202 2.9961
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2827357 0.2248177 -5.706 1.16e-08 ***
satisfaction_level -4.1286366 0.1172094 -35.224 < 2e-16 ***
last_evaluation 0.7392423 0.1790426 4.129 3.65e-05 ***
number_project -0.3291239 0.0255196 -12.897 < 2e-16 ***
average_montly_hours 0.0042719 0.0006165 6.929 4.24e-12 ***
time_spend_company 0.2824095 0.0185360 15.236 < 2e-16 ***
Work_accident -1.4922466 0.1050759 -14.202 < 2e-16 ***
promotion_last_5years -1.2400636 0.2777377 -4.465 8.01e-06 ***
saleshr 0.2034667 0.1553416 1.310 0.190263
salesIT -0.1535408 0.1440763 -1.066 0.286564
salesmanagement -0.5796677 0.1911888 -3.032 0.002430 **
salesmarketing 0.0519047 0.1566726 0.331 0.740422
salesproduct_mng -0.1894667 0.1543275 -1.228 0.219563
salesRandD -0.6380080 0.1746081 -3.654 0.000258 ***
salessales -0.0726861 0.1216017 -0.598 0.550014
salessupport 0.0920397 0.1290518 0.713 0.475722
salestechnical 0.0299766 0.1264105 0.237 0.812551
salarylow 1.7949986 0.1457106 12.319 < 2e-16 ***
salarymedium 1.2656258 0.1466135 8.632 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11542.6 on 10499 degrees of freedom
Residual deviance: 9067.3 on 10481 degrees of freedom
AIC: 9105.3
Number of Fisher Scoring iterations: 5
Running the glm model over the data we produce the above results. The variables which have “*" indicate that are statistically significant and should be included in the model.
We will remove the values which are not significant.
model <- glm(left ~ satisfaction_level + last_evaluation + number_project + time_spend_company + Work_accident + promotion_last_5years+ salary,data = train, family = "binomial")
summary(model)
Call:
glm(formula = left ~ satisfaction_level + last_evaluation + number_project +
time_spend_company + Work_accident + promotion_last_5years +
salary, family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0949 -0.6757 -0.4165 -0.1315 3.0610
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.02398 0.19106 -5.360 8.34e-08 ***
satisfaction_level -4.07982 0.11580 -35.232 < 2e-16 ***
last_evaluation 1.01937 0.17261 5.906 3.51e-09 ***
number_project -0.25707 0.02289 -11.233 < 2e-16 ***
time_spend_company 0.27703 0.01819 15.226 < 2e-16 ***
Work_accident -1.50926 0.10478 -14.405 < 2e-16 ***
promotion_last_5years -1.30746 0.27472 -4.759 1.94e-06 ***
salarylow 1.86831 0.14399 12.975 < 2e-16 ***
salarymedium 1.33818 0.14493 9.233 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11542.6 on 10499 degrees of freedom
Residual deviance: 9165.2 on 10491 degrees of freedom
AIC: 9183.2
Number of Fisher Scoring iterations: 5
Lets now predict on the test data set
predictions <- predict(model, newdata = test,type='response')
predictions <- ifelse(predictions > 0.5,1,0)
misClasificError <- mean(predictions != test$left)
print(paste('Accuracy',1-misClasificError))
[1] "Accuracy 0.800400088908646"
Finally we want to look at the overall accuracy of the model. To test this we will be using the “AUC” measure which is the balance between True Positive (Sensitivity) and False Positive (1-Specificity). The final accuracy of our model is 73%.
pROC::auc(predictions,test$left)
Area under the curve: 0.7262
In conclusion, this analysis was an introduction into how to apply logistic regression to make predictions. We covered some techniques used to explore the data set such as descriptive statistics and plotting. This is was only a brief preface but enough to get a basic understanding. Further reading is necessary to understand the overall process of modelling.