Variable

Employee ID: The unique ID allocated for each employee (example: fffe390032003000)
Date of Joining: The date-time when the employee has joined the organization (example: 2008-12-30)
Gender: The gender of the employee (Male/Female)
Company Type: The type of company where the employee is working (Service/Product)
WFH Setup Available: Is the work from home facility available for the employee (Yes/No)
Designation: The designation of the employee of work in the organization. In the range of [0.0, 5.0] bigger is higher designation.
Resource Allocation: The amount of resource allocated to the employee to work, ie. number of working hours.In the range of [1.0, 10.0] (higher means more resource)
Mental Fatigue Score: The level of fatigue mentally the employee is facing. In the range of [0.0, 10.0] where 0.0 means no fatigue and 10.0 means completely fatigue.
Burn Rate: The value we need to predict for each employee telling the rate of Bur out while working. In the range of [0.0, 1.0] where the higher the value is more is the burn out.

library(car)
library(caret) 
library(lmtest)
library(lubridate)
library(MASS)
library(robustbase)
library(tidyverse)
library(tidyr)

Load Data

bo<-read.csv("data/train.csv", stringsAsFactors = T)

str(bo)
## 'data.frame':    22750 obs. of  9 variables:
##  $ Employee.ID         : Factor w/ 22750 levels "fffe3100","fffe31003000",..: 7723 21063 2382 10791 6811 18173 17036 11870 8730 4492 ...
##  $ Date.of.Joining     : Factor w/ 366 levels "2008-01-01","2008-01-02",..: 274 335 70 308 206 331 2 305 362 69 ...
##  $ Gender              : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 1 1 1 ...
##  $ Company.Type        : Factor w/ 2 levels "Product","Service": 2 2 1 2 2 1 2 2 2 1 ...
##  $ WFH.Setup.Available : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 1 2 1 1 ...
##  $ Designation         : num  2 1 2 1 3 2 3 2 3 3 ...
##  $ Resource.Allocation : num  3 2 NA 1 7 4 6 4 6 6 ...
##  $ Mental.Fatigue.Score: num  3.8 5 5.8 2.6 6.9 3.6 7.9 4.4 NA NA ...
##  $ Burn.Rate           : num  0.16 0.36 0.49 0.2 0.52 0.29 0.62 0.33 0.56 0.67 ...

Data Pre-processing

Extract Dates

# Changing type on variables needed
bo <- bo%>%
  mutate(date=as.Date("2009-01-01"),
         md=interval(ymd(Date.of.Joining),ymd(date)),
         tenure= time_length(md, "days")) # making the tenure time based on days of working (as if it is 2019)

# Check tenure
head(bo[,c("Date.of.Joining","tenure")])

Drop Unnecessary Variables

bo<-bo%>%dplyr::select(-c(Employee.ID,Date.of.Joining,date,md))%>%head(500)
head(bo)

One Hot Encoding-For Dummies

All of the categorical variables should be treated as dummy variables. We can use dummyVars function from caret package to make those dummies.

dum <- dummyVars(" ~ .", data=bo)
bo_dummy <- data.frame(predict(dum, newdata = bo)) 

head(bo_dummy)

As we can see, there would many variables based on categorical created by dummyVars, but we can’t use them all. All we want to do now is just to take one variable that would be used for the model.

bo_dummy[,c('Gender.Female','Company.Type.Service','WFH.Setup.Available.No')]<-NULL

Handling NAs

NAs on target Variable

First of all, I would like to check whether the target variable has NAs on it

colSums(is.na(bo_dummy))
##             Gender.Male    Company.Type.Product WFH.Setup.Available.Yes 
##                       0                       0                       0 
##             Designation     Resource.Allocation    Mental.Fatigue.Score 
##                       0                      24                      49 
##               Burn.Rate                  tenure 
##                      25                       0

Apparently, on the target variable (Burn.Rate), there are 1124 rows that detected to have the NAs. We can get rid of the NAs by omit all of the rows that has NAs on the Burn.Rate variables as the NA proportions on target is just 4.94% to the total rows

bo_clean<-bo_dummy%>%
  drop_na(Burn.Rate)

colSums(is.na(bo_clean))
##             Gender.Male    Company.Type.Product WFH.Setup.Available.Yes 
##                       0                       0                       0 
##             Designation     Resource.Allocation    Mental.Fatigue.Score 
##                       0                      23                      41 
##               Burn.Rate                  tenure 
##                       0                       0

NAs on Predictor Variables

Handling missing values on predictor is a little bit tricky, because we don’t want to throw out some precious information that we could lose if we don’t treat it very well.

let’s see the pattern from those missing datas by using missing_plots on finalfit package

bo_clean%>%finalfit::missing_plot()

Then we can see that mainly, the NAs on Resource.Allocation and Mental.Fatigue.Score kind of random. From these datas, I’ll decide to rip off the rows that has 2 NAs on both variables and impute the rest with median.

bo_clean<-bo_clean%>%
  filter(!is.na(Resource.Allocation) | !is.na(Mental.Fatigue.Score))%>%
  mutate_all(~ifelse(is.na(.), mean(., na.rm = TRUE), .))

colSums(is.na(bo_clean))
##             Gender.Male    Company.Type.Product WFH.Setup.Available.Yes 
##                       0                       0                       0 
##             Designation     Resource.Allocation    Mental.Fatigue.Score 
##                       0                       0                       0 
##               Burn.Rate                  tenure 
##                       0                       0

Scalling Data

This data has multiple interpretation on many variables (see the variables section above) and it can ruin our prediction, so let’s normalize the variables needed by using preProcessing function from caret library.

proc <- preProcess(bo_clean[,c(1:6,8)], method=c("center", "scale"))
 
normz <- predict(proc, bo_clean[,c(1:6,8)])

summary(normz)
##   Gender.Male      Company.Type.Product WFH.Setup.Available.Yes
##  Min.   :-0.9003   Min.   :-0.7244      Min.   :-1.0622        
##  1st Qu.:-0.9003   1st Qu.:-0.7244      1st Qu.:-1.0622        
##  Median :-0.9003   Median :-0.7244      Median : 0.9395        
##  Mean   : 0.0000   Mean   : 0.0000      Mean   : 0.0000        
##  3rd Qu.: 1.1084   3rd Qu.: 1.3776      3rd Qu.: 0.9395        
##  Max.   : 1.1084   Max.   : 1.3776      Max.   : 0.9395        
##   Designation      Resource.Allocation Mental.Fatigue.Score     tenure        
##  Min.   :-1.8641   Min.   :-1.7191     Min.   :-3.2796      Min.   :-1.66727  
##  1st Qu.:-0.9911   1st Qu.:-0.7349     1st Qu.:-0.5768      1st Qu.:-0.85719  
##  Median :-0.1181   Median : 0.0000     Median : 0.0000      Median :-0.04711  
##  Mean   : 0.0000   Mean   : 0.0000     Mean   : 0.0000      Mean   : 0.00000  
##  3rd Qu.: 0.7549   3rd Qu.: 0.7414     3rd Qu.: 0.6056      3rd Qu.: 0.92127  
##  Max.   : 2.5008   Max.   : 2.7099     Max.   : 2.3512      Max.   : 1.72204

then make the dummy variables and normalize variables as 1 data frame

burnout<-cbind(bo_clean[c("Burn.Rate")],normz)

Summary of Cleaned Data

summary(burnout)
##    Burn.Rate       Gender.Male      Company.Type.Product
##  Min.   :0.0000   Min.   :-0.9003   Min.   :-0.7244     
##  1st Qu.:0.3400   1st Qu.:-0.9003   1st Qu.:-0.7244     
##  Median :0.4600   Median :-0.9003   Median :-0.7244     
##  Mean   :0.4661   Mean   : 0.0000   Mean   : 0.0000     
##  3rd Qu.:0.6100   3rd Qu.: 1.1084   3rd Qu.: 1.3776     
##  Max.   :1.0000   Max.   : 1.1084   Max.   : 1.3776     
##  WFH.Setup.Available.Yes  Designation      Resource.Allocation
##  Min.   :-1.0622         Min.   :-1.8641   Min.   :-1.7191    
##  1st Qu.:-1.0622         1st Qu.:-0.9911   1st Qu.:-0.7349    
##  Median : 0.9395         Median :-0.1181   Median : 0.0000    
##  Mean   : 0.0000         Mean   : 0.0000   Mean   : 0.0000    
##  3rd Qu.: 0.9395         3rd Qu.: 0.7549   3rd Qu.: 0.7414    
##  Max.   : 0.9395         Max.   : 2.5008   Max.   : 2.7099    
##  Mental.Fatigue.Score     tenure        
##  Min.   :-3.2796      Min.   :-1.66727  
##  1st Qu.:-0.5768      1st Qu.:-0.85719  
##  Median : 0.0000      Median :-0.04711  
##  Mean   : 0.0000      Mean   : 0.00000  
##  3rd Qu.: 0.6056      3rd Qu.: 0.92127  
##  Max.   : 2.3512      Max.   : 1.72204

Multivariat Linear Regression

Split Datas

For making the prediction on linear regression, I’ll split burnout dataframe into train and test sampling by having 80:20 ratio on the splitting process.

set.seed(156)
index <- sample((nrow(burnout)), size = round(0.8 * nrow(burnout), 0))

train <- burnout[index, ]
test <- burnout[-index, ]

Regressions

After the data splitting, let’s make model and compare the RMSE value to be chosen on the next step (assumption checking).

Original Model

The original model use the “Multiple Linear Regression” for checking the Burn.Rate variable prediction affected by all of the predictor variables.

the equations for this model:
\[Burn.Rate_i=\beta0+\delta _1Gender.Male_i+\delta _2Company.Type.Product_i+\delta _3WFH.Setup.Available.Yes_i+\\ \beta _1Designation_i+\beta _2Resource.Allocation_i+\beta _3Mental.Fatigue.Score_i+\beta _4tenure_i\]
reg<-lm(formula=Burn.Rate~. , data=train)
summary(reg)
## 
## Call:
## lm(formula = Burn.Rate ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28657 -0.04315 -0.00080  0.04105  0.31843 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)              0.4672863  0.0036979 126.365 < 0.0000000000000002 ***
## Gender.Male              0.0091581  0.0038259   2.394             0.017176 *  
## Company.Type.Product    -0.0000139  0.0036989  -0.004             0.997005    
## WFH.Setup.Available.Yes -0.0150929  0.0039992  -3.774             0.000187 ***
## Designation              0.0139612  0.0072954   1.914             0.056428 .  
## Resource.Allocation      0.0620790  0.0081987   7.572    0.000000000000297 ***
## Mental.Fatigue.Score     0.1115481  0.0057751  19.315 < 0.0000000000000002 ***
## tenure                  -0.0004988  0.0036497  -0.137             0.891358    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07169 on 370 degrees of freedom
## Multiple R-squared:  0.8565, Adjusted R-squared:  0.8538 
## F-statistic: 315.5 on 7 and 370 DF,  p-value: < 0.00000000000000022

As from the results, we found that tenure and Company.Type.Product seems to be a insignificant variables for the model, as their p-value above 0.1.

Stepwise Model

reg_bw<-step(reg, direction = "backward", trace=0)

summary(reg_bw)
## 
## Call:
## lm(formula = Burn.Rate ~ Gender.Male + WFH.Setup.Available.Yes + 
##     Designation + Resource.Allocation + Mental.Fatigue.Score, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28650 -0.04274 -0.00059  0.04062  0.31898 
## 
## Coefficients:
##                          Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)              0.467276   0.003687 126.729 < 0.0000000000000002 ***
## Gender.Male              0.009127   0.003808   2.397              0.01703 *  
## WFH.Setup.Available.Yes -0.015093   0.003988  -3.784              0.00018 ***
## Designation              0.014041   0.007251   1.936              0.05358 .  
## Resource.Allocation      0.062002   0.008156   7.602     0.00000000000024 ***
## Mental.Fatigue.Score     0.111551   0.005735  19.452 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0715 on 372 degrees of freedom
## Multiple R-squared:  0.8565, Adjusted R-squared:  0.8546 
## F-statistic:   444 on 5 and 372 DF,  p-value: < 0.00000000000000022
the equations for this model:
\[Burn.Rate_i=\beta0+\delta _1Gender.Male_i+\delta _2WFH.Setup.Available.Yes_i+\beta _1Designation_i+\\\beta _2Resource.Allocation_i+\beta _3Mental.Fatigue.Score_i\]

We can see the increasing of F-Statistics on this stepwise model, and reducing 2 variables that insignificant on the former model.

Checking the Error Valuation

I’ll using RMSE as the base on model comparisons

predik <- predict(reg, newdata = test %>% dplyr::select(-Burn.Rate))
predik2 <- predict(reg_bw, newdata = test %>% dplyr::select(-Burn.Rate))

# RMSE of train dataset
m1 <- RMSE(pred = reg$fitted.values, obs = train$Burn.Rate)
m2 <- RMSE(pred = reg_bw$fitted.values, obs = train$Burn.Rate)

# RMSE of test dataset
t1 <- RMSE(pred = predik, obs = test$Burn.Rate)
t2 <- RMSE(pred = predik2, obs = test$Burn.Rate)

a1<-AIC(reg)
a2<-AIC(reg_bw)

error<-c("pred","pred_bw")
train_value<-c(m1,m2)
test_value<-c(t1,t2)
aic<-c(a1,a2)

error1<-data.frame(error,train_value,test_value,aic)
error1

There are the things that missed, as the best model for training data is the stepwise model and for the test data is Original Model. and for the AIC values states that stepwise model is the best model.

Checking the Assumptions

Checking the Linear Assumption after regression is a mandatory thing to do, as we should fullfill all of the 5 assumptions on Multiple Linear Regression based on Gujarati(2003):
1. Linear Relationship
2. Normal Distribution
3. No Multicollinearity
4. Homoskedasticity
5. Zero Mean Value
6. No Outlier

plot

layout(matrix(c(1,2,3,4),2,2)) 
plot(reg_bw)

Outlier

Outlier could be the one of the inconsistency failure on OLS model.

cooksd<-cooks.distance(reg_bw)
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")  # plot cook's distance
abline(h = 4*mean(cooksd, na.rm=T), col="red")  # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red")

influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])  # influential row numbers
burnout[influential, ]

Linearity

plot(reg_bw, 1)

## Normality

resd <- studres(reg_bw)
hist(resd, freq=FALSE,
   main="Distribution of Studentized Residuals")
xfit<-seq(min(resd),max(resd),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)

shapiro.test(resd)
## 
##  Shapiro-Wilk normality test
## 
## data:  resd
## W = 0.96553, p-value = 0.00000008958

Independency

set.seed(156)
durbinWatsonTest(reg_bw)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.04222674      2.071021   0.536
##  Alternative hypothesis: rho != 0

Our stepwise model got the p-value>0.05, means that this model is fullfill the Independency assumption

Homoskedasticity

bptest(reg_bw)
## 
##  studentized Breusch-Pagan test
## 
## data:  reg_bw
## BP = 8.3426, df = 5, p-value = 0.1383

this model fullfill homoskedasticity assumption

spreadLevelPlot(reg_bw)

## 
## Suggested power transformation:  0.8880394

Fixing Model

Robust Regression

reg_rob<-lmrob(Burn.Rate ~ ., data = train)

summary(reg_rob)
## 
## Call:
## lmrob(formula = Burn.Rate ~ ., data = train)
##  \--> method = "MM"
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.3029939 -0.0412375  0.0008909  0.0417735  0.3542767 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)              0.4646889  0.0032966 140.961 < 0.0000000000000002 ***
## Gender.Male              0.0098914  0.0034308   2.883              0.00417 ** 
## Company.Type.Product     0.0006442  0.0034145   0.189              0.85046    
## WFH.Setup.Available.Yes -0.0112517  0.0035023  -3.213              0.00143 ** 
## Designation              0.0130227  0.0069973   1.861              0.06352 .  
## Resource.Allocation      0.0463739  0.0085911   5.398          0.000000121 ***
## Mental.Fatigue.Score     0.1247048  0.0056773  21.965 < 0.0000000000000002 ***
## tenure                   0.0014984  0.0032085   0.467              0.64078    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Robust residual standard error: 0.06311 
## Multiple R-squared:  0.8864, Adjusted R-squared:  0.8843 
## Convergence in 11 IRWLS iterations
## 
## Robustness weights: 
##  3 observations c(5,223,337) are outliers with |weight| = 0 ( < 0.00026); 
##  28 weights are ~= 1. The remaining 347 ones are summarized as
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0007028 0.8836000 0.9574000 0.9054000 0.9839000 0.9988000 
## Algorithmic parameters: 
##        tuning.chi                bb        tuning.psi        refine.tol 
## 1.547640000000000 0.500000000000000 4.685061000000000 0.000000100000000 
##           rel.tol         scale.tol         solve.tol       eps.outlier 
## 0.000000100000000 0.000000000100000 0.000000100000000 0.000264550264550 
##             eps.x warn.limit.reject warn.limit.meanrw 
## 0.000000000005966 0.500000000000000 0.500000000000000 
##      nResample         max.it       best.r.s       k.fast.s          k.max 
##            500             50              2              1            200 
##    maxit.scale      trace.lev            mts     compute.rd fast.s.large.n 
##            200              0           1000              0           2000 
##                   psi           subsampling                   cov 
##            "bisquare"         "nonsingular"         ".vcov.avar1" 
## compute.outlier.stats 
##                  "SM" 
## seed : int(0)

Weighted Least Square

wls <- 1 / lm(abs(reg_bw$residuals) ~ reg_bw$fitted.values)$fitted.values^2
reg_wls <- lm(Burn.Rate ~ ., data = train, weights=wls)

summary(reg_wls)
## 
## Call:
## lm(formula = Burn.Rate ~ ., data = train, weights = wls)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4366 -0.7720 -0.0063  0.7870  5.7641 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)              0.4672491  0.0036782 127.031 < 0.0000000000000002 ***
## Gender.Male              0.0093795  0.0037959   2.471             0.013925 *  
## Company.Type.Product     0.0000354  0.0036682   0.010             0.992305    
## WFH.Setup.Available.Yes -0.0153340  0.0039464  -3.886             0.000121 ***
## Designation              0.0140920  0.0072622   1.940             0.053083 .  
## Resource.Allocation      0.0612390  0.0081773   7.489    0.000000000000514 ***
## Mental.Fatigue.Score     0.1120892  0.0055973  20.026 < 0.0000000000000002 ***
## tenure                   0.0001600  0.0036201   0.044             0.964776    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.343 on 370 degrees of freedom
## Multiple R-squared:  0.8599, Adjusted R-squared:  0.8572 
## F-statistic: 324.3 on 7 and 370 DF,  p-value: < 0.00000000000000022

Comparison of All Models

predik3 <- predict(reg_rob, newdata = test %>% dplyr::select(-Burn.Rate))
predik4 <- predict(reg_wls, newdata = test %>% dplyr::select(-Burn.Rate) )

# RMSE of train dataset
m3 <- RMSE(pred = reg_rob$fitted.values, obs = train$Burn.Rate)
m4 <- RMSE(pred = reg_wls$fitted.values, obs = train$Burn.Rate)

# RMSE of test dataset
t3 <- RMSE(pred = predik3, obs = test$Burn.Rate)
t4 <- RMSE(pred = predik4, obs = test$Burn.Rate)

# AIC
a4<-AIC(reg_wls)

error1<-error1%>%add_row(error=c("pred_robust","pred_wls"),
                 train_value=c(m3,m4),
                 test_value=c(t3,t4),
                 aic=c(NA,a4))
error1

best model we could consider to use is weighted least square regression

this is our best last plot

layout(matrix(c(1,2,3,4),2,2)) 
plot(reg_wls)

Conclusion

Data that consisted many categorical variable could lead to the unproportional linear model. Consider to test another model outside the linear regression.

Reference

  1. Gujarati, D. N. (2004). Basic econometrics. New Delhi: Tata McGraw-Hill.