Employee ID: The unique ID allocated for each employee (example: fffe390032003000)
Date of Joining: The date-time when the employee has joined the organization (example: 2008-12-30)
Gender: The gender of the employee (Male/Female)
Company Type: The type of company where the employee is working (Service/Product)
WFH Setup Available: Is the work from home facility available for the employee (Yes/No)
Designation: The designation of the employee of work in the organization. In the range of [0.0, 5.0] bigger is higher designation.
Resource Allocation: The amount of resource allocated to the employee to work, ie. number of working hours.In the range of [1.0, 10.0] (higher means more resource)
Mental Fatigue Score: The level of fatigue mentally the employee is facing. In the range of [0.0, 10.0] where 0.0 means no fatigue and 10.0 means completely fatigue.
Burn Rate: The value we need to predict for each employee telling the rate of Bur out while working. In the range of [0.0, 1.0] where the higher the value is more is the burn out.
library(car)
library(caret)
library(lmtest)
library(lubridate)
library(MASS)
library(robustbase)
library(tidyverse)
library(tidyr)
bo<-read.csv("data/train.csv", stringsAsFactors = T)
str(bo)
## 'data.frame': 22750 obs. of 9 variables:
## $ Employee.ID : Factor w/ 22750 levels "fffe3100","fffe31003000",..: 7723 21063 2382 10791 6811 18173 17036 11870 8730 4492 ...
## $ Date.of.Joining : Factor w/ 366 levels "2008-01-01","2008-01-02",..: 274 335 70 308 206 331 2 305 362 69 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 1 1 1 ...
## $ Company.Type : Factor w/ 2 levels "Product","Service": 2 2 1 2 2 1 2 2 2 1 ...
## $ WFH.Setup.Available : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 1 2 1 1 ...
## $ Designation : num 2 1 2 1 3 2 3 2 3 3 ...
## $ Resource.Allocation : num 3 2 NA 1 7 4 6 4 6 6 ...
## $ Mental.Fatigue.Score: num 3.8 5 5.8 2.6 6.9 3.6 7.9 4.4 NA NA ...
## $ Burn.Rate : num 0.16 0.36 0.49 0.2 0.52 0.29 0.62 0.33 0.56 0.67 ...
# Changing type on variables needed
bo <- bo%>%
mutate(date=as.Date("2009-01-01"),
md=interval(ymd(Date.of.Joining),ymd(date)),
tenure= time_length(md, "days")) # making the tenure time based on days of working (as if it is 2019)
# Check tenure
head(bo[,c("Date.of.Joining","tenure")])
bo<-bo%>%dplyr::select(-c(Employee.ID,Date.of.Joining,date,md))%>%head(500)
head(bo)
All of the categorical variables should be treated as dummy variables. We can use dummyVars function from caret package to make those dummies.
dum <- dummyVars(" ~ .", data=bo)
bo_dummy <- data.frame(predict(dum, newdata = bo))
head(bo_dummy)
As we can see, there would many variables based on categorical created by dummyVars, but we can’t use them all. All we want to do now is just to take one variable that would be used for the model.
bo_dummy[,c('Gender.Female','Company.Type.Service','WFH.Setup.Available.No')]<-NULL
First of all, I would like to check whether the target variable has NAs on it
colSums(is.na(bo_dummy))
## Gender.Male Company.Type.Product WFH.Setup.Available.Yes
## 0 0 0
## Designation Resource.Allocation Mental.Fatigue.Score
## 0 24 49
## Burn.Rate tenure
## 25 0
Apparently, on the target variable (Burn.Rate), there are 1124 rows that detected to have the NAs. We can get rid of the NAs by omit all of the rows that has NAs on the Burn.Rate variables as the NA proportions on target is just 4.94% to the total rows
bo_clean<-bo_dummy%>%
drop_na(Burn.Rate)
colSums(is.na(bo_clean))
## Gender.Male Company.Type.Product WFH.Setup.Available.Yes
## 0 0 0
## Designation Resource.Allocation Mental.Fatigue.Score
## 0 23 41
## Burn.Rate tenure
## 0 0
Handling missing values on predictor is a little bit tricky, because we don’t want to throw out some precious information that we could lose if we don’t treat it very well.
let’s see the pattern from those missing datas by using missing_plots on finalfit package
bo_clean%>%finalfit::missing_plot()
Then we can see that mainly, the NAs on Resource.Allocation and Mental.Fatigue.Score kind of random. From these datas, I’ll decide to rip off the rows that has 2 NAs on both variables and impute the rest with median.
bo_clean<-bo_clean%>%
filter(!is.na(Resource.Allocation) | !is.na(Mental.Fatigue.Score))%>%
mutate_all(~ifelse(is.na(.), mean(., na.rm = TRUE), .))
colSums(is.na(bo_clean))
## Gender.Male Company.Type.Product WFH.Setup.Available.Yes
## 0 0 0
## Designation Resource.Allocation Mental.Fatigue.Score
## 0 0 0
## Burn.Rate tenure
## 0 0
This data has multiple interpretation on many variables (see the variables section above) and it can ruin our prediction, so let’s normalize the variables needed by using preProcessing function from caret library.
proc <- preProcess(bo_clean[,c(1:6,8)], method=c("center", "scale"))
normz <- predict(proc, bo_clean[,c(1:6,8)])
summary(normz)
## Gender.Male Company.Type.Product WFH.Setup.Available.Yes
## Min. :-0.9003 Min. :-0.7244 Min. :-1.0622
## 1st Qu.:-0.9003 1st Qu.:-0.7244 1st Qu.:-1.0622
## Median :-0.9003 Median :-0.7244 Median : 0.9395
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.1084 3rd Qu.: 1.3776 3rd Qu.: 0.9395
## Max. : 1.1084 Max. : 1.3776 Max. : 0.9395
## Designation Resource.Allocation Mental.Fatigue.Score tenure
## Min. :-1.8641 Min. :-1.7191 Min. :-3.2796 Min. :-1.66727
## 1st Qu.:-0.9911 1st Qu.:-0.7349 1st Qu.:-0.5768 1st Qu.:-0.85719
## Median :-0.1181 Median : 0.0000 Median : 0.0000 Median :-0.04711
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.7549 3rd Qu.: 0.7414 3rd Qu.: 0.6056 3rd Qu.: 0.92127
## Max. : 2.5008 Max. : 2.7099 Max. : 2.3512 Max. : 1.72204
then make the dummy variables and normalize variables as 1 data frame
burnout<-cbind(bo_clean[c("Burn.Rate")],normz)
summary(burnout)
## Burn.Rate Gender.Male Company.Type.Product
## Min. :0.0000 Min. :-0.9003 Min. :-0.7244
## 1st Qu.:0.3400 1st Qu.:-0.9003 1st Qu.:-0.7244
## Median :0.4600 Median :-0.9003 Median :-0.7244
## Mean :0.4661 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:0.6100 3rd Qu.: 1.1084 3rd Qu.: 1.3776
## Max. :1.0000 Max. : 1.1084 Max. : 1.3776
## WFH.Setup.Available.Yes Designation Resource.Allocation
## Min. :-1.0622 Min. :-1.8641 Min. :-1.7191
## 1st Qu.:-1.0622 1st Qu.:-0.9911 1st Qu.:-0.7349
## Median : 0.9395 Median :-0.1181 Median : 0.0000
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.9395 3rd Qu.: 0.7549 3rd Qu.: 0.7414
## Max. : 0.9395 Max. : 2.5008 Max. : 2.7099
## Mental.Fatigue.Score tenure
## Min. :-3.2796 Min. :-1.66727
## 1st Qu.:-0.5768 1st Qu.:-0.85719
## Median : 0.0000 Median :-0.04711
## Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6056 3rd Qu.: 0.92127
## Max. : 2.3512 Max. : 1.72204
For making the prediction on linear regression, I’ll split burnout dataframe into train and test sampling by having 80:20 ratio on the splitting process.
set.seed(156)
index <- sample((nrow(burnout)), size = round(0.8 * nrow(burnout), 0))
train <- burnout[index, ]
test <- burnout[-index, ]
After the data splitting, let’s make model and compare the RMSE value to be chosen on the next step (assumption checking).
The original model use the “Multiple Linear Regression” for checking the Burn.Rate variable prediction affected by all of the predictor variables.
reg<-lm(formula=Burn.Rate~. , data=train)
summary(reg)
##
## Call:
## lm(formula = Burn.Rate ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28657 -0.04315 -0.00080 0.04105 0.31843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4672863 0.0036979 126.365 < 0.0000000000000002 ***
## Gender.Male 0.0091581 0.0038259 2.394 0.017176 *
## Company.Type.Product -0.0000139 0.0036989 -0.004 0.997005
## WFH.Setup.Available.Yes -0.0150929 0.0039992 -3.774 0.000187 ***
## Designation 0.0139612 0.0072954 1.914 0.056428 .
## Resource.Allocation 0.0620790 0.0081987 7.572 0.000000000000297 ***
## Mental.Fatigue.Score 0.1115481 0.0057751 19.315 < 0.0000000000000002 ***
## tenure -0.0004988 0.0036497 -0.137 0.891358
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07169 on 370 degrees of freedom
## Multiple R-squared: 0.8565, Adjusted R-squared: 0.8538
## F-statistic: 315.5 on 7 and 370 DF, p-value: < 0.00000000000000022
As from the results, we found that tenure and Company.Type.Product seems to be a insignificant variables for the model, as their p-value above 0.1.
reg_bw<-step(reg, direction = "backward", trace=0)
summary(reg_bw)
##
## Call:
## lm(formula = Burn.Rate ~ Gender.Male + WFH.Setup.Available.Yes +
## Designation + Resource.Allocation + Mental.Fatigue.Score,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28650 -0.04274 -0.00059 0.04062 0.31898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.467276 0.003687 126.729 < 0.0000000000000002 ***
## Gender.Male 0.009127 0.003808 2.397 0.01703 *
## WFH.Setup.Available.Yes -0.015093 0.003988 -3.784 0.00018 ***
## Designation 0.014041 0.007251 1.936 0.05358 .
## Resource.Allocation 0.062002 0.008156 7.602 0.00000000000024 ***
## Mental.Fatigue.Score 0.111551 0.005735 19.452 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0715 on 372 degrees of freedom
## Multiple R-squared: 0.8565, Adjusted R-squared: 0.8546
## F-statistic: 444 on 5 and 372 DF, p-value: < 0.00000000000000022
the equations for this model:
We can see the increasing of F-Statistics on this stepwise model, and reducing 2 variables that insignificant on the former model.
I’ll using RMSE as the base on model comparisons
predik <- predict(reg, newdata = test %>% dplyr::select(-Burn.Rate))
predik2 <- predict(reg_bw, newdata = test %>% dplyr::select(-Burn.Rate))
# RMSE of train dataset
m1 <- RMSE(pred = reg$fitted.values, obs = train$Burn.Rate)
m2 <- RMSE(pred = reg_bw$fitted.values, obs = train$Burn.Rate)
# RMSE of test dataset
t1 <- RMSE(pred = predik, obs = test$Burn.Rate)
t2 <- RMSE(pred = predik2, obs = test$Burn.Rate)
a1<-AIC(reg)
a2<-AIC(reg_bw)
error<-c("pred","pred_bw")
train_value<-c(m1,m2)
test_value<-c(t1,t2)
aic<-c(a1,a2)
error1<-data.frame(error,train_value,test_value,aic)
error1
There are the things that missed, as the best model for training data is the stepwise model and for the test data is Original Model. and for the AIC values states that stepwise model is the best model.
Checking the Linear Assumption after regression is a mandatory thing to do, as we should fullfill all of the 5 assumptions on Multiple Linear Regression based on Gujarati(2003):
1. Linear Relationship
2. Normal Distribution
3. No Multicollinearity
4. Homoskedasticity
5. Zero Mean Value
6. No Outlier
layout(matrix(c(1,2,3,4),2,2))
plot(reg_bw)
Outlier could be the one of the inconsistency failure on OLS model.
cooksd<-cooks.distance(reg_bw)
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") # plot cook's distance
abline(h = 4*mean(cooksd, na.rm=T), col="red") # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red")
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))]) # influential row numbers
burnout[influential, ]
plot(reg_bw, 1)
## Normality
resd <- studres(reg_bw)
hist(resd, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(resd),max(resd),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
shapiro.test(resd)
##
## Shapiro-Wilk normality test
##
## data: resd
## W = 0.96553, p-value = 0.00000008958
set.seed(156)
durbinWatsonTest(reg_bw)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.04222674 2.071021 0.536
## Alternative hypothesis: rho != 0
Our stepwise model got the p-value>0.05, means that this model is fullfill the Independency assumption
bptest(reg_bw)
##
## studentized Breusch-Pagan test
##
## data: reg_bw
## BP = 8.3426, df = 5, p-value = 0.1383
this model fullfill homoskedasticity assumption
spreadLevelPlot(reg_bw)
##
## Suggested power transformation: 0.8880394
reg_rob<-lmrob(Burn.Rate ~ ., data = train)
summary(reg_rob)
##
## Call:
## lmrob(formula = Burn.Rate ~ ., data = train)
## \--> method = "MM"
## Residuals:
## Min 1Q Median 3Q Max
## -0.3029939 -0.0412375 0.0008909 0.0417735 0.3542767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4646889 0.0032966 140.961 < 0.0000000000000002 ***
## Gender.Male 0.0098914 0.0034308 2.883 0.00417 **
## Company.Type.Product 0.0006442 0.0034145 0.189 0.85046
## WFH.Setup.Available.Yes -0.0112517 0.0035023 -3.213 0.00143 **
## Designation 0.0130227 0.0069973 1.861 0.06352 .
## Resource.Allocation 0.0463739 0.0085911 5.398 0.000000121 ***
## Mental.Fatigue.Score 0.1247048 0.0056773 21.965 < 0.0000000000000002 ***
## tenure 0.0014984 0.0032085 0.467 0.64078
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Robust residual standard error: 0.06311
## Multiple R-squared: 0.8864, Adjusted R-squared: 0.8843
## Convergence in 11 IRWLS iterations
##
## Robustness weights:
## 3 observations c(5,223,337) are outliers with |weight| = 0 ( < 0.00026);
## 28 weights are ~= 1. The remaining 347 ones are summarized as
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007028 0.8836000 0.9574000 0.9054000 0.9839000 0.9988000
## Algorithmic parameters:
## tuning.chi bb tuning.psi refine.tol
## 1.547640000000000 0.500000000000000 4.685061000000000 0.000000100000000
## rel.tol scale.tol solve.tol eps.outlier
## 0.000000100000000 0.000000000100000 0.000000100000000 0.000264550264550
## eps.x warn.limit.reject warn.limit.meanrw
## 0.000000000005966 0.500000000000000 0.500000000000000
## nResample max.it best.r.s k.fast.s k.max
## 500 50 2 1 200
## maxit.scale trace.lev mts compute.rd fast.s.large.n
## 200 0 1000 0 2000
## psi subsampling cov
## "bisquare" "nonsingular" ".vcov.avar1"
## compute.outlier.stats
## "SM"
## seed : int(0)
wls <- 1 / lm(abs(reg_bw$residuals) ~ reg_bw$fitted.values)$fitted.values^2
reg_wls <- lm(Burn.Rate ~ ., data = train, weights=wls)
summary(reg_wls)
##
## Call:
## lm(formula = Burn.Rate ~ ., data = train, weights = wls)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -5.4366 -0.7720 -0.0063 0.7870 5.7641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4672491 0.0036782 127.031 < 0.0000000000000002 ***
## Gender.Male 0.0093795 0.0037959 2.471 0.013925 *
## Company.Type.Product 0.0000354 0.0036682 0.010 0.992305
## WFH.Setup.Available.Yes -0.0153340 0.0039464 -3.886 0.000121 ***
## Designation 0.0140920 0.0072622 1.940 0.053083 .
## Resource.Allocation 0.0612390 0.0081773 7.489 0.000000000000514 ***
## Mental.Fatigue.Score 0.1120892 0.0055973 20.026 < 0.0000000000000002 ***
## tenure 0.0001600 0.0036201 0.044 0.964776
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.343 on 370 degrees of freedom
## Multiple R-squared: 0.8599, Adjusted R-squared: 0.8572
## F-statistic: 324.3 on 7 and 370 DF, p-value: < 0.00000000000000022
predik3 <- predict(reg_rob, newdata = test %>% dplyr::select(-Burn.Rate))
predik4 <- predict(reg_wls, newdata = test %>% dplyr::select(-Burn.Rate) )
# RMSE of train dataset
m3 <- RMSE(pred = reg_rob$fitted.values, obs = train$Burn.Rate)
m4 <- RMSE(pred = reg_wls$fitted.values, obs = train$Burn.Rate)
# RMSE of test dataset
t3 <- RMSE(pred = predik3, obs = test$Burn.Rate)
t4 <- RMSE(pred = predik4, obs = test$Burn.Rate)
# AIC
a4<-AIC(reg_wls)
error1<-error1%>%add_row(error=c("pred_robust","pred_wls"),
train_value=c(m3,m4),
test_value=c(t3,t4),
aic=c(NA,a4))
error1
best model we could consider to use is weighted least square regression
this is our best last plot
layout(matrix(c(1,2,3,4),2,2))
plot(reg_wls)
Data that consisted many categorical variable could lead to the unproportional linear model. Consider to test another model outside the linear regression.