Introduction

Brief

Employes Productivity

For Garment Industry, employes productivity will influence their income. That’s why HR Departement want to optimize the productivity. For profesional employes, internal factor will not interupt their job. So as an analyst we want to help our HR Departement to know what factor can influence their productivity and make some Machine Learning model to optimize employes productivity.

About Data

This dataset from Kaggle (https://www.kaggle.com/ishadss/productivity-prediction-of-garment-employees). In this data set we will find 15 column :

date : Date in MM-DD-YYYY

day : Day of the Week

quarter : A portion of the month. A month was divided into four quarters

department : Associated department with the instance

teamno : Associated team number with the instance

noofworkers : Number of workers in each team

noofstylechange : Number of changes in the style of a particular product

targetedproductivity : Targeted productivity set by the Authority for each team for each day.

smv : Standard Minute Value, it is the allocated time for a task

wip : Work in progress. Includes the number of unfinished items for products 11 overtime : Represents the amount of overtime by each team in minutes

incentive : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.

idletime : The amount of time when the production was interrupted due to several reasons

idlemen : The number of workers who were idle due to production interruption

actual_productivity : The actual % of productivity that was delivered by the workers. It ranges from 0-1.

Preparation

First start with prepare our tools to make ML

library(tidyverse)
library(ggplot2)
library(dplyr)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(gsubfn)

Read the data

df <- read.csv("garments_worker_productivity.csv")
glimpse(df)
## Rows: 1,197
## Columns: 15
## $ date                  <chr> "1/1/2015", "1/1/2015", "1/1/2015", "1/1/2015", ~
## $ quarter               <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", ~
## $ department            <chr> "sweing", "finishing ", "sweing", "sweing", "swe~
## $ day                   <chr> "Thursday", "Thursday", "Thursday", "Thursday", ~
## $ team                  <int> 8, 1, 11, 12, 6, 7, 2, 3, 2, 1, 9, 10, 5, 10, 8,~
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, ~
## $ smv                   <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2~
## $ wip                   <int> 1108, NA, 968, 968, 1170, 984, NA, 795, 733, 681~
## $ over_time             <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60~
## $ incentive             <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50~
## $ idle_time             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ idle_men              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_style_change    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_workers         <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55~
## $ actual_productivity   <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80~

Date as quarter group : quarter 1 ~ 1-7 quarter 2 ~ 8-14 quarter 3 ~ 15-21 quarter 4 ~ 22-28 quarter 5 ~ 29-31

EDA

Before we clean the data, we will try find out some insight.

we have incentive column, how much we spent the money for employees incentive each quarter?

df_iq <- aggregate(incentive ~ quarter, df, FUN = sum)
df_iq <- df_iq[order(df_iq$incentive, decreasing = T),]
ggplot(df_iq, aes(x= incentive, y = reorder(quarter, incentive))) +
  geom_col() +
  labs(x= "Incentive",
       y = "Quarter")

In quarter 2, we spent so much money than another quarter. At the same time, that we have actual productivity in that periods?

df_aq <- aggregate(actual_productivity ~ quarter, df, FUN = sum)
df_aq <- df_aq[order(df_aq$actual_productivity, decreasing = T),]
ggplot(df_aq, aes(x= actual_productivity, y = reorder(quarter, actual_productivity))) +
  geom_col() +
  labs(x= "Incentive",
       y = "Quarter")

Very interesting, those two bar tell us that even we spent a lot of incentive for employees, it’s not secure us to get actual productivity from them. But it must make sure by Machine Learning. What’s happen with targeted_productivity?

df_atp <- aggregate(targeted_productivity ~ quarter, df, FUN = mean)
df_atp <- df_atp[order(df_atp$targeted_productivity, decreasing = T),]
ggplot(df_atp, aes(x= targeted_productivity, y = reorder(quarter,targeted_productivity))) +
  geom_col() +
  labs(x= "Targeted Productivity",
       y = "Quarter") +
  geom_point()

Targeted_productivity mean’s in quarter 1 have good correlation with our target. Because in this case Quarter 1 is higher than another quarter like actual_productivity bar. In this case we have many predictors, so we will try to find out what and how predictors influence actual productivity by Regression Model.

Cleansing Data

We have 1197 row and 15 column. We have some column with date and day but we will note use it, because quarter column will replace and grouping this data by time. Department column will be perfect separator because finishing don’t have wip score/NA so we will delete it. idle_time, idle_men, no_of_style_change have majority score at 0, so will take out.

df <- df %>% 
  select(-day, -idle_time, -idle_men, -no_of_style_change,-date, -department) %>% 
  mutate(quarter = as.factor(quarter)) 
df_new <- df %>% 
  mutate_all(~ifelse(is.na(.), mean(., na.rm = TRUE), .))

In this chunk we take out all NA values, and then we check it.

colSums(is.na(df_new))
##               quarter                  team targeted_productivity 
##                     0                     0                     0 
##                   smv                   wip             over_time 
##                     0                     0                     0 
##             incentive         no_of_workers   actual_productivity 
##                     0                     0                     0
df <- na.omit(df)
anyNA(df)
## [1] FALSE

Now our data set don’t have any NA values

Check correlation score

ggcorr(df, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

We can see from this graphic that no_of_worker and over_time dont have correlation but we will keep it. incentive and targeted_productivity have high correlation, even tough smv, wip, and team have low correlation.

Cross Validation

For make sure our Machine Learning model, we start to split our data to train and test data set

set.seed(1)
index <- sample(nrow(df), 0.8*nrow(df))

df_train <- df[index,]
df_test <- df[-index,]

Check data train

anyNA(df_train)
## [1] FALSE
glimpse(df_train)
## Rows: 552
## Columns: 9
## $ quarter               <fct> Quarter2, Quarter2, Quarter4, Quarter3, Quarter5~
## $ team                  <int> 7, 11, 6, 1, 4, 6, 11, 8, 4, 5, 8, 8, 11, 9, 3, ~
## $ targeted_productivity <dbl> 0.60, 0.70, 0.75, 0.80, 0.80, 0.80, 0.70, 0.50, ~
## $ smv                   <dbl> 30.48, 27.13, 18.79, 22.52, 22.52, 30.40, 14.89,~
## $ wip                   <int> 1017, 530, 1055, 1445, 1432, 338, 1484, 1144, 83~
## $ over_time             <int> 6840, 9540, 3960, 6840, 6660, 3840, 10260, 6480,~
## $ incentive             <int> 25, 37, 45, 113, 88, 34, 50, 0, 55, 88, 0, 0, 50~
## $ no_of_workers         <dbl> 57.0, 53.0, 33.0, 57.0, 57.5, 32.0, 57.0, 54.0, ~
## $ actual_productivity   <dbl> 0.6304029, 0.5823010, 0.7506510, 1.0002304, 0.90~

Modeling

For the modeling we will use 1 predictor and more than 1 predictor. We choose the predictor with high correlation score for model with 1 predictor. But we will use stepwise to choose predictor in model with more than 1 predictor.

With 1 predictor

model_lm1 <- lm(actual_productivity~incentive, data = df)
summary(model_lm1)
## 
## Call:
## lm(formula = actual_productivity ~ incentive, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34282 -0.03081 -0.00117  0.05331  0.27960 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.521379   0.006648   78.43   <2e-16 ***
## incentive   0.004510   0.000127   35.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09207 on 689 degrees of freedom
## Multiple R-squared:  0.6467, Adjusted R-squared:  0.6461 
## F-statistic:  1261 on 1 and 689 DF,  p-value: < 2.2e-16
plot(df_train$incentive, df_train$actual_productivity)
abline(model_lm1$coefficients[1],model_lm1$coefficients[2])

Even tough we make bar plot, that’s not explain our insight clearly. And we can see in this plot by regression model, our model can make a good linear line between actual_productivity and incentive. From this plot we can say, that when we spent high incentive, we can get high actual productivity. This model have positive correlation.

But we need best model for our case, so we make another model.

Stepwise

After Make model from 1 predictor, We will try to make Machine Learning model with many predictor. In this model we try use “both” stepwise to add and delete some predictor automatically to develop the model.

model_all <- lm(actual_productivity~., data = df_train)
model_none <- lm(actual_productivity~1, data = df_train)
model_both <- step(object = model_none, 
                  scope = list(lower = model_none, upper = model_all), 
                  direction = "both")
## Start:  AIC=-2093.17
## actual_productivity ~ 1
## 
##                         Df Sum of Sq     RSS     AIC
## + incentive              1    7.7951  4.6081 -2637.7
## + targeted_productivity  1    6.0187  6.3845 -2457.7
## + smv                    1    0.3213 12.0819 -2105.7
## + team                   1    0.2706 12.1326 -2103.3
## + wip                    1    0.2152 12.1880 -2100.8
## + quarter                4    0.3138 12.0894 -2099.3
## <none>                               12.4032 -2093.2
## + over_time              1    0.0219 12.3813 -2092.1
## + no_of_workers          1    0.0024 12.4008 -2091.3
## 
## Step:  AIC=-2637.72
## actual_productivity ~ incentive
## 
##                         Df Sum of Sq     RSS     AIC
## + targeted_productivity  1    1.6446  2.9635 -2879.4
## + over_time              1    0.2029  4.4052 -2660.6
## + smv                    1    0.0984  4.5097 -2647.6
## + no_of_workers          1    0.0757  4.5324 -2644.9
## + quarter                4    0.1160  4.4921 -2643.8
## <none>                                4.6081 -2637.7
## + wip                    1    0.0006  4.6075 -2635.8
## + team                   1    0.0003  4.6078 -2635.8
## - incentive              1    7.7951 12.4032 -2093.2
## 
## Step:  AIC=-2879.4
## actual_productivity ~ incentive + targeted_productivity
## 
##                         Df Sum of Sq    RSS     AIC
## + smv                    1    0.1267 2.8367 -2901.5
## + over_time              1    0.0522 2.9113 -2887.2
## + team                   1    0.0208 2.9427 -2881.3
## <none>                               2.9635 -2879.4
## + no_of_workers          1    0.0084 2.9551 -2879.0
## + wip                    1    0.0001 2.9633 -2877.4
## + quarter                4    0.0153 2.9481 -2874.3
## - targeted_productivity  1    1.6446 4.6081 -2637.7
## - incentive              1    3.4210 6.3845 -2457.7
## 
## Step:  AIC=-2901.53
## actual_productivity ~ incentive + targeted_productivity + smv
## 
##                         Df Sum of Sq    RSS     AIC
## + team                   1    0.0944 2.7423 -2918.2
## + no_of_workers          1    0.0249 2.8118 -2904.4
## + over_time              1    0.0186 2.8181 -2903.2
## <none>                               2.8367 -2901.5
## + wip                    1    0.0003 2.8364 -2899.6
## + quarter                4    0.0123 2.8244 -2895.9
## - smv                    1    0.1267 2.9635 -2879.4
## - targeted_productivity  1    1.6729 4.5097 -2647.6
## - incentive              1    3.2653 6.1021 -2480.7
## 
## Step:  AIC=-2918.22
## actual_productivity ~ incentive + targeted_productivity + smv + 
##     team
## 
##                         Df Sum of Sq    RSS     AIC
## + over_time              1   0.02050 2.7218 -2920.4
## + no_of_workers          1   0.01627 2.7261 -2919.5
## <none>                               2.7423 -2918.2
## + wip                    1   0.00060 2.7417 -2916.3
## + quarter                4   0.01100 2.7313 -2912.4
## - team                   1   0.09441 2.8367 -2901.5
## - smv                    1   0.20039 2.9427 -2881.3
## - targeted_productivity  1   1.74508 4.4874 -2648.4
## - incentive              1   2.77856 5.5209 -2534.0
## 
## Step:  AIC=-2920.36
## actual_productivity ~ incentive + targeted_productivity + smv + 
##     team + over_time
## 
##                         Df Sum of Sq    RSS     AIC
## + no_of_workers          1   0.02430 2.6975 -2923.3
## <none>                               2.7218 -2920.4
## + wip                    1   0.00048 2.7214 -2918.4
## - over_time              1   0.02050 2.7423 -2918.2
## + quarter                4   0.01209 2.7097 -2914.8
## - team                   1   0.09625 2.8181 -2903.2
## - smv                    1   0.15916 2.8810 -2891.0
## - targeted_productivity  1   1.61471 4.3366 -2665.2
## - incentive              1   2.76302 5.4849 -2535.6
## 
## Step:  AIC=-2923.31
## actual_productivity ~ incentive + targeted_productivity + smv + 
##     team + over_time + no_of_workers
## 
##                         Df Sum of Sq    RSS     AIC
## <none>                               2.6975 -2923.3
## + wip                    1   0.00076 2.6968 -2921.5
## - no_of_workers          1   0.02430 2.7218 -2920.4
## - over_time              1   0.02853 2.7261 -2919.5
## + quarter                4   0.01073 2.6868 -2917.5
## - team                   1   0.08606 2.7836 -2908.0
## - smv                    1   0.17756 2.8751 -2890.1
## - targeted_productivity  1   1.63671 4.3342 -2663.5
## - incentive              1   2.58668 5.2842 -2554.2
summary(model_both)
## 
## Call:
## lm(formula = actual_productivity ~ incentive + targeted_productivity + 
##     smv + team + over_time + no_of_workers, data = df_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34551 -0.02554  0.00498  0.03271  0.18949 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.624e-01  3.404e-02   4.771 2.35e-06 ***
## incentive              3.062e-03  1.339e-04  22.861  < 2e-16 ***
## targeted_productivity  6.882e-01  3.785e-02  18.184  < 2e-16 ***
## smv                   -3.375e-03  5.634e-04  -5.989 3.83e-09 ***
## team                  -4.083e-03  9.793e-04  -4.170 3.55e-05 ***
## over_time             -2.689e-06  1.120e-06  -2.401   0.0167 *  
## no_of_workers          9.274e-04  4.185e-04   2.216   0.0271 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07035 on 545 degrees of freedom
## Multiple R-squared:  0.7825, Adjusted R-squared:  0.7801 
## F-statistic: 326.8 on 6 and 545 DF,  p-value: < 2.2e-16

Stepwise will help use to choose best predictor combination for our model. In this model we use Incentive, targeted_productivity, smv, and team because have high correlation with "***" and the other are over_time and no_of_workers.

Model Evaluation

After make model, we must evaluate it to know how good this model and try to optimize it.

make some predict variable to help evaluation

pred_lm1 <- predict(model_lm1, newdata = data.frame(df_test))
pred_both <- predict(model_both,newdata = data.frame(df_test))

r.squared

Watch the r.quared score for model_lm1 and adj.r.squared for model_both. We use the adj.r.squared when our model have more than 1 predictor.

summary(model_lm1)$r.squared
## [1] 0.6466551
summary(model_both)$adj.r.squared
## [1] 0.780119

Between those two model, model with stepwise and more than 1 predictor have better result with our model can describe 78% data set. It can be held because of the predictor of our target. More than 1 predictor help our machine learning to get the better formula to get the close result with the target.

Error

We use RMSE to see the error, because of that we need the range values in our target.

range(df$actual_productivity)
## [1] 0.2337055 1.1004839

RMSE 1 predictor model

sqrt(mean((pred_lm1)^2))
## [1] 0.7213513
# MAPE(pred_lm1, df_test$actual_productivity)

RMSE more than 1 predictor model

sqrt(mean((pred_both)^2))
## [1] 0.7165908
#MAPE(pred_both, df_test$actual_productivity) 

We can see from RMSE score that model with more than 1 predictor is better than another model. But our Error still high because it’s close with our target’s range. Good model have to be more smaller than our range.

Assumption Test

Linearity Check

lin_ev <- data.frame(residual = model_both$residuals, fitted = model_both$fitted.values)
lin_ev %>% 
  ggplot(aes(x= fitted,y = residual)) +
  geom_point() +
  geom_smooth() +
  geom_hline(aes(yintercept = 0))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Our model is linear enough, that’s because we can see our line still in the abline. But we must make sure it with the correlation test.

Correlation test have simple rules, that is :

H0 = not linear (p > 0.05)

H1 = linear (p < 0.05)

ggcorr(df_train, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

For 0 correlation point on this map not we use, because 0 = dont have correlation.

cor.test(df_train$actual_productivity, df_train$wip)
## 
##  Pearson's product-moment correlation
## 
## data:  df_train$actual_productivity and df_train$wip
## t = 3.1162, df = 550, p-value = 0.001928
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04879806 0.21283147
## sample estimates:
##       cor 
## 0.1317163

p-value < 0.05, We can say between actual_productivity and wip have correlation and it’s linear.

cor.test(df_train$actual_productivity, df_train$smv)
## 
##  Pearson's product-moment correlation
## 
## data:  df_train$actual_productivity and df_train$smv
## t = -3.8244, df = 550, p-value = 0.0001461
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2411612 -0.0785461
## sample estimates:
##        cor 
## -0.1609458

p-value < 0.05, We can say between actual_productivity and smv have correlation and linear.

cor.test(df_train$actual_productivity, df_train$team)
## 
##  Pearson's product-moment correlation
## 
## data:  df_train$actual_productivity and df_train$team
## t = -3.5023, df = 550, p-value = 0.0004987
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.22834243 -0.06504959
## sample estimates:
##        cor 
## -0.1477024

p-value < 0.05, We can say between actual_productivity and team have correlation and linear. For incentive and targeted_productivity we dont need to check because their correlation score is good enough. After we check all of our predictor, we know that all our predictor that close with 0 correlation is have correlation with our terget and it’s linear.

From this Assumtion test, our predictor have linear correlation with the target.

Normality of Residual

shapiro.test(model_both$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_both$residuals
## W = 0.8618, p-value < 2.2e-16

p-value < 0.05, so the error not distribute normally. It’s not good for our model, so we must take out the outlier in some data to fix it or we can tunning the data set.

Homoscedascity

H0: model homoscedasticity

H1: model heteroscedasticity

bptest(model_both)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_both
## BP = 112.11, df = 6, p-value < 2.2e-16

p-value < 0.05, so our model heteroscedasticity. We must tuning our data’s model by take out outlier, log, or boxcox.

Little to no multicollinearity

vif(model_both)
##             incentive targeted_productivity                   smv 
##              1.498793              1.393705              1.748780 
##                  team             over_time         no_of_workers 
##              1.243627              1.166217              1.706660

VIF score from this model is good (< 10) so we can use all of them to be our predictor. From all assumption test, our model rejected by 2 assumption test. We try to fix it by tuning our model.

Tuning Model

2 Method

Take Out Extreme Outlier

  • Take Out Outlier

First we will take out from wip column.

wipout <- boxplot(df_train$wip)$out

And we will take out some outlier in over_time column.

ot_out <- boxplot(df_train$over_time)$out

Check again to know if we still have extreme outlier.

df_train_new <- df_train %>% 
  filter(!wip %in% wipout) %>% 
  select(-wip, -over_time)
boxplot(df_train_new)

Try to make one model again

# df_train_new <- df_train %>% 
#   select(actual_productivity, incentive, wip, smv, targeted_productivity, team)
model_new_all <- lm(actual_productivity~., data = df_train_new)
model_new_none <- lm(actual_productivity~1, data = df_train_new)
model_back <- stats::step(object = model_new_none,
                          scope = list(lower = model_new_none, upper = model_new_all),
                          direction = "both")
## Start:  AIC=-2042.03
## actual_productivity ~ 1
## 
##                         Df Sum of Sq     RSS     AIC
## + incentive              1    7.1786  4.4401 -2553.7
## + targeted_productivity  1    5.9534  5.6653 -2423.6
## + smv                    1    0.3032 11.3155 -2054.2
## + team                   1    0.2154 11.4033 -2050.0
## + quarter                4    0.2743 11.3444 -2046.8
## <none>                               11.6187 -2042.0
## + no_of_workers          1    0.0019 11.6168 -2040.1
## 
## Step:  AIC=-2553.71
## actual_productivity ~ incentive
## 
##                         Df Sum of Sq     RSS     AIC
## + targeted_productivity  1    1.7203  2.7198 -2813.4
## + smv                    1    0.0964  4.3437 -2563.4
## + quarter                4    0.1197  4.3204 -2560.3
## + no_of_workers          1    0.0686  4.3715 -2560.0
## <none>                                4.4401 -2553.7
## + team                   1    0.0001  4.4400 -2551.7
## - incentive              1    7.1786 11.6187 -2042.0
## 
## Step:  AIC=-2813.43
## actual_productivity ~ incentive + targeted_productivity
## 
##                         Df Sum of Sq    RSS     AIC
## + smv                    1   0.12285 2.5969 -2836.1
## + team                   1   0.01813 2.7017 -2815.0
## <none>                               2.7198 -2813.4
## + no_of_workers          1   0.00386 2.7159 -2812.2
## + quarter                4   0.01680 2.7030 -2808.7
## - targeted_productivity  1   1.72031 4.4401 -2553.7
## - incentive              1   2.94551 5.6653 -2423.6
## 
## Step:  AIC=-2836.12
## actual_productivity ~ incentive + targeted_productivity + smv
## 
##                         Df Sum of Sq    RSS     AIC
## + team                   1   0.08712 2.5098 -2852.3
## + no_of_workers          1   0.03678 2.5602 -2841.7
## <none>                               2.5969 -2836.1
## + quarter                4   0.01347 2.5835 -2830.9
## - smv                    1   0.12285 2.7198 -2813.4
## - targeted_productivity  1   1.74680 4.3437 -2563.4
## - incentive              1   2.80866 5.4056 -2446.6
## 
## Step:  AIC=-2852.34
## actual_productivity ~ incentive + targeted_productivity + smv + 
##     team
## 
##                         Df Sum of Sq    RSS     AIC
## + no_of_workers          1   0.02635 2.4835 -2856.0
## <none>                               2.5098 -2852.3
## + quarter                4   0.01210 2.4977 -2846.9
## - team                   1   0.08712 2.5969 -2836.1
## - smv                    1   0.19184 2.7017 -2815.0
## - targeted_productivity  1   1.81457 4.3244 -2563.8
## - incentive              1   2.40717 4.9170 -2495.2
## 
## Step:  AIC=-2855.97
## actual_productivity ~ incentive + targeted_productivity + smv + 
##     team + no_of_workers
## 
##                         Df Sum of Sq    RSS     AIC
## <none>                               2.4835 -2856.0
## - no_of_workers          1   0.02635 2.5098 -2852.3
## + quarter                4   0.01067 2.4728 -2850.3
## - team                   1   0.07669 2.5602 -2841.7
## - smv                    1   0.20610 2.6896 -2815.4
## - targeted_productivity  1   1.82812 4.3116 -2563.4
## - incentive              1   2.18541 4.6689 -2520.9
summary(model_back)
## 
## Call:
## lm(formula = actual_productivity ~ incentive + targeted_productivity + 
##     smv + team + no_of_workers, data = df_train_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34022 -0.02333  0.00487  0.03264  0.18739 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.1301318  0.0333051   3.907 0.000105 ***
## incentive              0.0028577  0.0001326  21.555  < 2e-16 ***
## targeted_productivity  0.7293125  0.0369935  19.715  < 2e-16 ***
## smv                   -0.0036533  0.0005519  -6.620 8.87e-11 ***
## team                  -0.0039069  0.0009676  -4.038 6.19e-05 ***
## no_of_workers          0.0009631  0.0004069   2.367 0.018306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06858 on 528 degrees of freedom
## Multiple R-squared:  0.7863, Adjusted R-squared:  0.7842 
## F-statistic: 388.4 on 5 and 528 DF,  p-value: < 2.2e-16

The adj.r.squared have better score but not significant from our first model.

pred_new <- predict(model_new_all,newdata = data.frame(df_test))
sqrt(mean((pred_new)^2))
## [1] 0.7203989
#MAPE(pred_new, df_test$actual_productivity) 

Our new model “model_back” error is bigger than our last model “pred_both”. Based on RMSE, our “pred_both” model is better than our new models even we take out the outlier. So we will tery use another method for tuning our data.

  • Test Assumption

Linearity Check

lin_ev <- data.frame(residual = model_back$residuals, fitted = model_back$fitted.values)
lin_ev %>% 
  ggplot(aes(x= fitted,y = residual)) +
  geom_point() +
  geom_smooth() +
  geom_hline(aes(yintercept = 0))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Normality of Residuals

shapiro.test(model_back$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_back$residuals
## W = 0.8516, p-value < 2.2e-16

Homoscedacity

bptest(model_back)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_back
## BP = 96.342, df = 5, p-value < 2.2e-16

Little to no multicollinearity

vif(model_back)
##             incentive targeted_productivity                   smv 
##              1.468937              1.377447              1.730808 
##                  team         no_of_workers 
##              1.239884              1.660926

For overall this new model is not significant different with the our first model result and still reject from 2 assumption test. So we try to tuning our data set again because we want to make better model. Our model not accepted by homoscedacity check and Residuals not normally distributed.

Use Boxcox Function

  • Boxcox

Define the boxcox number, first we make a plot use boxcox

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
boxcox(model_new_none, plotit = TRUE, lambda = seq(1.6, 1.9, by = 0.1))

We will use 1.8 because it is close intersection line.

model_new <- lm((((actual_productivity^1.8)-1)/1.8)~., data = df_train_new)
model_sn <- lm(model_new, direction = "backward")
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'direction' will be disregarded
summary(model_sn)
## 
## Call:
## lm(formula = model_new, direction = "backward")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.205367 -0.019987  0.000595  0.020097  0.167459 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -6.406e-01  2.219e-02 -28.867  < 2e-16 ***
## quarterQuarter2       -1.290e-03  5.110e-03  -0.253  0.80073    
## quarterQuarter3       -6.606e-03  5.837e-03  -1.132  0.25825    
## quarterQuarter4       -7.645e-03  5.510e-03  -1.387  0.16589    
## quarterQuarter5       -9.116e-03  1.030e-02  -0.885  0.37648    
## team                  -3.048e-03  6.277e-04  -4.857 1.58e-06 ***
## targeted_productivity  4.800e-01  2.438e-02  19.686  < 2e-16 ***
## smv                   -2.502e-03  3.587e-04  -6.977 9.17e-12 ***
## incentive              2.275e-03  8.799e-05  25.854  < 2e-16 ***
## no_of_workers          7.510e-04  2.642e-04   2.843  0.00464 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04446 on 524 degrees of freedom
## Multiple R-squared:  0.8275, Adjusted R-squared:  0.8245 
## F-statistic: 279.3 on 9 and 524 DF,  p-value: < 2.2e-16
pred_sn <- predict(model_sn,newdata = data.frame(df_test))

RMSE

sqrt(mean((pred_sn)^2))
## [1] 0.2744685

After we use boxcox method for tuning, we get better adj.r.squared score than all of other model and the RMSE score low than the target’s range. Then we must chcek the assummption test.

  • Assumption Test

Linearity Check

lin_ev <- data.frame(residual = model_new$residuals, fitted = model_new$fitted.values)
lin_ev %>% 
  ggplot(aes(x= fitted,y = residual)) +
  geom_point() +
  geom_smooth() +
  geom_hline(aes(yintercept = 0))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Normality of Residuals

shapiro.test(model_new$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_new$residuals
## W = 0.92702, p-value = 1.927e-15

Homoscedacity

bptest(model_new)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_new
## BP = 88.888, df = 9, p-value = 2.72e-15

Little to no multicollinearity

vif(model_new)
##                           GVIF Df GVIF^(1/(2*Df))
## quarter               1.076791  4        1.009291
## team                  1.241652  1        1.114295
## targeted_productivity 1.424020  1        1.193323
## smv                   1.739834  1        1.319028
## incentive             1.539679  1        1.240838
## no_of_workers         1.665534  1        1.290556

Tuning model by boxcox using data train from “takeout outlier” tuning model have better adj.r.squared and RMSE than other models, but 2 assumption test still reject our model.

Conclusion

In this case our model not fit with the data set, even tough we have good score for the model. RMSE score for our error not good because it is large in our range, and our model reject from 2 assumption test. For the future this data set maybe can use another Machine Learning Method or another tuning data set method.