For Garment Industry, employes productivity will influence their income. That’s why HR Departement want to optimize the productivity. For profesional employes, internal factor will not interupt their job. So as an analyst we want to help our HR Departement to know what factor can influence their productivity and make some Machine Learning model to optimize employes productivity.
This dataset from Kaggle (https://www.kaggle.com/ishadss/productivity-prediction-of-garment-employees). In this data set we will find 15 column :
date : Date in MM-DD-YYYY
day : Day of the Week
quarter : A portion of the month. A month was divided into four quarters
department : Associated department with the instance
teamno : Associated team number with the instance
noofworkers : Number of workers in each team
noofstylechange : Number of changes in the style of a particular product
targetedproductivity : Targeted productivity set by the Authority for each team for each day.
smv : Standard Minute Value, it is the allocated time for a task
wip : Work in progress. Includes the number of unfinished items for products 11 overtime : Represents the amount of overtime by each team in minutes
incentive : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
idletime : The amount of time when the production was interrupted due to several reasons
idlemen : The number of workers who were idle due to production interruption
actual_productivity : The actual % of productivity that was delivered by the workers. It ranges from 0-1.
First start with prepare our tools to make ML
library(tidyverse)
library(ggplot2)
library(dplyr)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(gsubfn)Read the data
df <- read.csv("garments_worker_productivity.csv")
glimpse(df)## Rows: 1,197
## Columns: 15
## $ date <chr> "1/1/2015", "1/1/2015", "1/1/2015", "1/1/2015", ~
## $ quarter <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", ~
## $ department <chr> "sweing", "finishing ", "sweing", "sweing", "swe~
## $ day <chr> "Thursday", "Thursday", "Thursday", "Thursday", ~
## $ team <int> 8, 1, 11, 12, 6, 7, 2, 3, 2, 1, 9, 10, 5, 10, 8,~
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, ~
## $ smv <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2~
## $ wip <int> 1108, NA, 968, 968, 1170, 984, NA, 795, 733, 681~
## $ over_time <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60~
## $ incentive <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50~
## $ idle_time <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ idle_men <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_style_change <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_workers <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55~
## $ actual_productivity <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80~
Date as quarter group : quarter 1 ~ 1-7 quarter 2 ~ 8-14 quarter 3 ~ 15-21 quarter 4 ~ 22-28 quarter 5 ~ 29-31
Before we clean the data, we will try find out some insight.
we have incentive column, how much we spent the money for employees incentive each quarter?
df_iq <- aggregate(incentive ~ quarter, df, FUN = sum)
df_iq <- df_iq[order(df_iq$incentive, decreasing = T),]
ggplot(df_iq, aes(x= incentive, y = reorder(quarter, incentive))) +
geom_col() +
labs(x= "Incentive",
y = "Quarter")In quarter 2, we spent so much money than another quarter. At the same time, that we have actual productivity in that periods?
df_aq <- aggregate(actual_productivity ~ quarter, df, FUN = sum)
df_aq <- df_aq[order(df_aq$actual_productivity, decreasing = T),]
ggplot(df_aq, aes(x= actual_productivity, y = reorder(quarter, actual_productivity))) +
geom_col() +
labs(x= "Incentive",
y = "Quarter")Very interesting, those two bar tell us that even we spent a lot of incentive for employees, it’s not secure us to get actual productivity from them. But it must make sure by Machine Learning. What’s happen with targeted_productivity?
df_atp <- aggregate(targeted_productivity ~ quarter, df, FUN = mean)
df_atp <- df_atp[order(df_atp$targeted_productivity, decreasing = T),]
ggplot(df_atp, aes(x= targeted_productivity, y = reorder(quarter,targeted_productivity))) +
geom_col() +
labs(x= "Targeted Productivity",
y = "Quarter") +
geom_point() Targeted_productivity mean’s in quarter 1 have good correlation with our target. Because in this case Quarter 1 is higher than another quarter like actual_productivity bar. In this case we have many predictors, so we will try to find out what and how predictors influence actual productivity by Regression Model.
We have 1197 row and 15 column. We have some column with date and day but we will note use it, because quarter column will replace and grouping this data by time. Department column will be perfect separator because finishing don’t have wip score/NA so we will delete it. idle_time, idle_men, no_of_style_change have majority score at 0, so will take out.
df <- df %>%
select(-day, -idle_time, -idle_men, -no_of_style_change,-date, -department) %>%
mutate(quarter = as.factor(quarter))
df_new <- df %>%
mutate_all(~ifelse(is.na(.), mean(., na.rm = TRUE), .))In this chunk we take out all NA values, and then we check it.
colSums(is.na(df_new))## quarter team targeted_productivity
## 0 0 0
## smv wip over_time
## 0 0 0
## incentive no_of_workers actual_productivity
## 0 0 0
df <- na.omit(df)
anyNA(df)## [1] FALSE
Now our data set don’t have any NA values
ggcorr(df, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)We can see from this graphic that no_of_worker and over_time dont have correlation but we will keep it. incentive and targeted_productivity have high correlation, even tough smv, wip, and team have low correlation.
For make sure our Machine Learning model, we start to split our data to train and test data set
set.seed(1)
index <- sample(nrow(df), 0.8*nrow(df))
df_train <- df[index,]
df_test <- df[-index,]Check data train
anyNA(df_train)## [1] FALSE
glimpse(df_train)## Rows: 552
## Columns: 9
## $ quarter <fct> Quarter2, Quarter2, Quarter4, Quarter3, Quarter5~
## $ team <int> 7, 11, 6, 1, 4, 6, 11, 8, 4, 5, 8, 8, 11, 9, 3, ~
## $ targeted_productivity <dbl> 0.60, 0.70, 0.75, 0.80, 0.80, 0.80, 0.70, 0.50, ~
## $ smv <dbl> 30.48, 27.13, 18.79, 22.52, 22.52, 30.40, 14.89,~
## $ wip <int> 1017, 530, 1055, 1445, 1432, 338, 1484, 1144, 83~
## $ over_time <int> 6840, 9540, 3960, 6840, 6660, 3840, 10260, 6480,~
## $ incentive <int> 25, 37, 45, 113, 88, 34, 50, 0, 55, 88, 0, 0, 50~
## $ no_of_workers <dbl> 57.0, 53.0, 33.0, 57.0, 57.5, 32.0, 57.0, 54.0, ~
## $ actual_productivity <dbl> 0.6304029, 0.5823010, 0.7506510, 1.0002304, 0.90~
For the modeling we will use 1 predictor and more than 1 predictor. We choose the predictor with high correlation score for model with 1 predictor. But we will use stepwise to choose predictor in model with more than 1 predictor.
model_lm1 <- lm(actual_productivity~incentive, data = df)
summary(model_lm1)##
## Call:
## lm(formula = actual_productivity ~ incentive, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34282 -0.03081 -0.00117 0.05331 0.27960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.521379 0.006648 78.43 <2e-16 ***
## incentive 0.004510 0.000127 35.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09207 on 689 degrees of freedom
## Multiple R-squared: 0.6467, Adjusted R-squared: 0.6461
## F-statistic: 1261 on 1 and 689 DF, p-value: < 2.2e-16
plot(df_train$incentive, df_train$actual_productivity)
abline(model_lm1$coefficients[1],model_lm1$coefficients[2]) Even tough we make bar plot, that’s not explain our insight clearly. And we can see in this plot by regression model, our model can make a good linear line between actual_productivity and incentive. From this plot we can say, that when we spent high incentive, we can get high actual productivity. This model have positive correlation.
But we need best model for our case, so we make another model.
After Make model from 1 predictor, We will try to make Machine Learning model with many predictor. In this model we try use “both” stepwise to add and delete some predictor automatically to develop the model.
model_all <- lm(actual_productivity~., data = df_train)
model_none <- lm(actual_productivity~1, data = df_train)
model_both <- step(object = model_none,
scope = list(lower = model_none, upper = model_all),
direction = "both")## Start: AIC=-2093.17
## actual_productivity ~ 1
##
## Df Sum of Sq RSS AIC
## + incentive 1 7.7951 4.6081 -2637.7
## + targeted_productivity 1 6.0187 6.3845 -2457.7
## + smv 1 0.3213 12.0819 -2105.7
## + team 1 0.2706 12.1326 -2103.3
## + wip 1 0.2152 12.1880 -2100.8
## + quarter 4 0.3138 12.0894 -2099.3
## <none> 12.4032 -2093.2
## + over_time 1 0.0219 12.3813 -2092.1
## + no_of_workers 1 0.0024 12.4008 -2091.3
##
## Step: AIC=-2637.72
## actual_productivity ~ incentive
##
## Df Sum of Sq RSS AIC
## + targeted_productivity 1 1.6446 2.9635 -2879.4
## + over_time 1 0.2029 4.4052 -2660.6
## + smv 1 0.0984 4.5097 -2647.6
## + no_of_workers 1 0.0757 4.5324 -2644.9
## + quarter 4 0.1160 4.4921 -2643.8
## <none> 4.6081 -2637.7
## + wip 1 0.0006 4.6075 -2635.8
## + team 1 0.0003 4.6078 -2635.8
## - incentive 1 7.7951 12.4032 -2093.2
##
## Step: AIC=-2879.4
## actual_productivity ~ incentive + targeted_productivity
##
## Df Sum of Sq RSS AIC
## + smv 1 0.1267 2.8367 -2901.5
## + over_time 1 0.0522 2.9113 -2887.2
## + team 1 0.0208 2.9427 -2881.3
## <none> 2.9635 -2879.4
## + no_of_workers 1 0.0084 2.9551 -2879.0
## + wip 1 0.0001 2.9633 -2877.4
## + quarter 4 0.0153 2.9481 -2874.3
## - targeted_productivity 1 1.6446 4.6081 -2637.7
## - incentive 1 3.4210 6.3845 -2457.7
##
## Step: AIC=-2901.53
## actual_productivity ~ incentive + targeted_productivity + smv
##
## Df Sum of Sq RSS AIC
## + team 1 0.0944 2.7423 -2918.2
## + no_of_workers 1 0.0249 2.8118 -2904.4
## + over_time 1 0.0186 2.8181 -2903.2
## <none> 2.8367 -2901.5
## + wip 1 0.0003 2.8364 -2899.6
## + quarter 4 0.0123 2.8244 -2895.9
## - smv 1 0.1267 2.9635 -2879.4
## - targeted_productivity 1 1.6729 4.5097 -2647.6
## - incentive 1 3.2653 6.1021 -2480.7
##
## Step: AIC=-2918.22
## actual_productivity ~ incentive + targeted_productivity + smv +
## team
##
## Df Sum of Sq RSS AIC
## + over_time 1 0.02050 2.7218 -2920.4
## + no_of_workers 1 0.01627 2.7261 -2919.5
## <none> 2.7423 -2918.2
## + wip 1 0.00060 2.7417 -2916.3
## + quarter 4 0.01100 2.7313 -2912.4
## - team 1 0.09441 2.8367 -2901.5
## - smv 1 0.20039 2.9427 -2881.3
## - targeted_productivity 1 1.74508 4.4874 -2648.4
## - incentive 1 2.77856 5.5209 -2534.0
##
## Step: AIC=-2920.36
## actual_productivity ~ incentive + targeted_productivity + smv +
## team + over_time
##
## Df Sum of Sq RSS AIC
## + no_of_workers 1 0.02430 2.6975 -2923.3
## <none> 2.7218 -2920.4
## + wip 1 0.00048 2.7214 -2918.4
## - over_time 1 0.02050 2.7423 -2918.2
## + quarter 4 0.01209 2.7097 -2914.8
## - team 1 0.09625 2.8181 -2903.2
## - smv 1 0.15916 2.8810 -2891.0
## - targeted_productivity 1 1.61471 4.3366 -2665.2
## - incentive 1 2.76302 5.4849 -2535.6
##
## Step: AIC=-2923.31
## actual_productivity ~ incentive + targeted_productivity + smv +
## team + over_time + no_of_workers
##
## Df Sum of Sq RSS AIC
## <none> 2.6975 -2923.3
## + wip 1 0.00076 2.6968 -2921.5
## - no_of_workers 1 0.02430 2.7218 -2920.4
## - over_time 1 0.02853 2.7261 -2919.5
## + quarter 4 0.01073 2.6868 -2917.5
## - team 1 0.08606 2.7836 -2908.0
## - smv 1 0.17756 2.8751 -2890.1
## - targeted_productivity 1 1.63671 4.3342 -2663.5
## - incentive 1 2.58668 5.2842 -2554.2
summary(model_both)##
## Call:
## lm(formula = actual_productivity ~ incentive + targeted_productivity +
## smv + team + over_time + no_of_workers, data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34551 -0.02554 0.00498 0.03271 0.18949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.624e-01 3.404e-02 4.771 2.35e-06 ***
## incentive 3.062e-03 1.339e-04 22.861 < 2e-16 ***
## targeted_productivity 6.882e-01 3.785e-02 18.184 < 2e-16 ***
## smv -3.375e-03 5.634e-04 -5.989 3.83e-09 ***
## team -4.083e-03 9.793e-04 -4.170 3.55e-05 ***
## over_time -2.689e-06 1.120e-06 -2.401 0.0167 *
## no_of_workers 9.274e-04 4.185e-04 2.216 0.0271 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07035 on 545 degrees of freedom
## Multiple R-squared: 0.7825, Adjusted R-squared: 0.7801
## F-statistic: 326.8 on 6 and 545 DF, p-value: < 2.2e-16
Stepwise will help use to choose best predictor combination for our model. In this model we use Incentive, targeted_productivity, smv, and team because have high correlation with "***" and the other are over_time and no_of_workers.
After make model, we must evaluate it to know how good this model and try to optimize it.
make some predict variable to help evaluation
pred_lm1 <- predict(model_lm1, newdata = data.frame(df_test))
pred_both <- predict(model_both,newdata = data.frame(df_test))Watch the r.quared score for model_lm1 and adj.r.squared for model_both. We use the adj.r.squared when our model have more than 1 predictor.
summary(model_lm1)$r.squared## [1] 0.6466551
summary(model_both)$adj.r.squared## [1] 0.780119
Between those two model, model with stepwise and more than 1 predictor have better result with our model can describe 78% data set. It can be held because of the predictor of our target. More than 1 predictor help our machine learning to get the better formula to get the close result with the target.
We use RMSE to see the error, because of that we need the range values in our target.
range(df$actual_productivity)## [1] 0.2337055 1.1004839
RMSE 1 predictor model
sqrt(mean((pred_lm1)^2))## [1] 0.7213513
# MAPE(pred_lm1, df_test$actual_productivity)RMSE more than 1 predictor model
sqrt(mean((pred_both)^2))## [1] 0.7165908
#MAPE(pred_both, df_test$actual_productivity) We can see from RMSE score that model with more than 1 predictor is better than another model. But our Error still high because it’s close with our target’s range. Good model have to be more smaller than our range.
lin_ev <- data.frame(residual = model_both$residuals, fitted = model_both$fitted.values)
lin_ev %>%
ggplot(aes(x= fitted,y = residual)) +
geom_point() +
geom_smooth() +
geom_hline(aes(yintercept = 0))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Our model is linear enough, that’s because we can see our line still in the abline. But we must make sure it with the correlation test.
Correlation test have simple rules, that is :
H0 = not linear (p > 0.05)
H1 = linear (p < 0.05)
ggcorr(df_train, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)For 0 correlation point on this map not we use, because 0 = dont have correlation.
cor.test(df_train$actual_productivity, df_train$wip)##
## Pearson's product-moment correlation
##
## data: df_train$actual_productivity and df_train$wip
## t = 3.1162, df = 550, p-value = 0.001928
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04879806 0.21283147
## sample estimates:
## cor
## 0.1317163
p-value < 0.05, We can say between actual_productivity and wip have correlation and it’s linear.
cor.test(df_train$actual_productivity, df_train$smv)##
## Pearson's product-moment correlation
##
## data: df_train$actual_productivity and df_train$smv
## t = -3.8244, df = 550, p-value = 0.0001461
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2411612 -0.0785461
## sample estimates:
## cor
## -0.1609458
p-value < 0.05, We can say between actual_productivity and smv have correlation and linear.
cor.test(df_train$actual_productivity, df_train$team)##
## Pearson's product-moment correlation
##
## data: df_train$actual_productivity and df_train$team
## t = -3.5023, df = 550, p-value = 0.0004987
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.22834243 -0.06504959
## sample estimates:
## cor
## -0.1477024
p-value < 0.05, We can say between actual_productivity and team have correlation and linear. For incentive and targeted_productivity we dont need to check because their correlation score is good enough. After we check all of our predictor, we know that all our predictor that close with 0 correlation is have correlation with our terget and it’s linear.
From this Assumtion test, our predictor have linear correlation with the target.
shapiro.test(model_both$residuals)##
## Shapiro-Wilk normality test
##
## data: model_both$residuals
## W = 0.8618, p-value < 2.2e-16
p-value < 0.05, so the error not distribute normally. It’s not good for our model, so we must take out the outlier in some data to fix it or we can tunning the data set.
H0: model homoscedasticity
H1: model heteroscedasticity
bptest(model_both)##
## studentized Breusch-Pagan test
##
## data: model_both
## BP = 112.11, df = 6, p-value < 2.2e-16
p-value < 0.05, so our model heteroscedasticity. We must tuning our data’s model by take out outlier, log, or boxcox.
vif(model_both)## incentive targeted_productivity smv
## 1.498793 1.393705 1.748780
## team over_time no_of_workers
## 1.243627 1.166217 1.706660
VIF score from this model is good (< 10) so we can use all of them to be our predictor. From all assumption test, our model rejected by 2 assumption test. We try to fix it by tuning our model.
First we will take out from wip column.
wipout <- boxplot(df_train$wip)$outAnd we will take out some outlier in over_time column.
ot_out <- boxplot(df_train$over_time)$outCheck again to know if we still have extreme outlier.
df_train_new <- df_train %>%
filter(!wip %in% wipout) %>%
select(-wip, -over_time)
boxplot(df_train_new)Try to make one model again
# df_train_new <- df_train %>%
# select(actual_productivity, incentive, wip, smv, targeted_productivity, team)
model_new_all <- lm(actual_productivity~., data = df_train_new)
model_new_none <- lm(actual_productivity~1, data = df_train_new)
model_back <- stats::step(object = model_new_none,
scope = list(lower = model_new_none, upper = model_new_all),
direction = "both")## Start: AIC=-2042.03
## actual_productivity ~ 1
##
## Df Sum of Sq RSS AIC
## + incentive 1 7.1786 4.4401 -2553.7
## + targeted_productivity 1 5.9534 5.6653 -2423.6
## + smv 1 0.3032 11.3155 -2054.2
## + team 1 0.2154 11.4033 -2050.0
## + quarter 4 0.2743 11.3444 -2046.8
## <none> 11.6187 -2042.0
## + no_of_workers 1 0.0019 11.6168 -2040.1
##
## Step: AIC=-2553.71
## actual_productivity ~ incentive
##
## Df Sum of Sq RSS AIC
## + targeted_productivity 1 1.7203 2.7198 -2813.4
## + smv 1 0.0964 4.3437 -2563.4
## + quarter 4 0.1197 4.3204 -2560.3
## + no_of_workers 1 0.0686 4.3715 -2560.0
## <none> 4.4401 -2553.7
## + team 1 0.0001 4.4400 -2551.7
## - incentive 1 7.1786 11.6187 -2042.0
##
## Step: AIC=-2813.43
## actual_productivity ~ incentive + targeted_productivity
##
## Df Sum of Sq RSS AIC
## + smv 1 0.12285 2.5969 -2836.1
## + team 1 0.01813 2.7017 -2815.0
## <none> 2.7198 -2813.4
## + no_of_workers 1 0.00386 2.7159 -2812.2
## + quarter 4 0.01680 2.7030 -2808.7
## - targeted_productivity 1 1.72031 4.4401 -2553.7
## - incentive 1 2.94551 5.6653 -2423.6
##
## Step: AIC=-2836.12
## actual_productivity ~ incentive + targeted_productivity + smv
##
## Df Sum of Sq RSS AIC
## + team 1 0.08712 2.5098 -2852.3
## + no_of_workers 1 0.03678 2.5602 -2841.7
## <none> 2.5969 -2836.1
## + quarter 4 0.01347 2.5835 -2830.9
## - smv 1 0.12285 2.7198 -2813.4
## - targeted_productivity 1 1.74680 4.3437 -2563.4
## - incentive 1 2.80866 5.4056 -2446.6
##
## Step: AIC=-2852.34
## actual_productivity ~ incentive + targeted_productivity + smv +
## team
##
## Df Sum of Sq RSS AIC
## + no_of_workers 1 0.02635 2.4835 -2856.0
## <none> 2.5098 -2852.3
## + quarter 4 0.01210 2.4977 -2846.9
## - team 1 0.08712 2.5969 -2836.1
## - smv 1 0.19184 2.7017 -2815.0
## - targeted_productivity 1 1.81457 4.3244 -2563.8
## - incentive 1 2.40717 4.9170 -2495.2
##
## Step: AIC=-2855.97
## actual_productivity ~ incentive + targeted_productivity + smv +
## team + no_of_workers
##
## Df Sum of Sq RSS AIC
## <none> 2.4835 -2856.0
## - no_of_workers 1 0.02635 2.5098 -2852.3
## + quarter 4 0.01067 2.4728 -2850.3
## - team 1 0.07669 2.5602 -2841.7
## - smv 1 0.20610 2.6896 -2815.4
## - targeted_productivity 1 1.82812 4.3116 -2563.4
## - incentive 1 2.18541 4.6689 -2520.9
summary(model_back)##
## Call:
## lm(formula = actual_productivity ~ incentive + targeted_productivity +
## smv + team + no_of_workers, data = df_train_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34022 -0.02333 0.00487 0.03264 0.18739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1301318 0.0333051 3.907 0.000105 ***
## incentive 0.0028577 0.0001326 21.555 < 2e-16 ***
## targeted_productivity 0.7293125 0.0369935 19.715 < 2e-16 ***
## smv -0.0036533 0.0005519 -6.620 8.87e-11 ***
## team -0.0039069 0.0009676 -4.038 6.19e-05 ***
## no_of_workers 0.0009631 0.0004069 2.367 0.018306 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06858 on 528 degrees of freedom
## Multiple R-squared: 0.7863, Adjusted R-squared: 0.7842
## F-statistic: 388.4 on 5 and 528 DF, p-value: < 2.2e-16
The adj.r.squared have better score but not significant from our first model.
pred_new <- predict(model_new_all,newdata = data.frame(df_test))sqrt(mean((pred_new)^2))## [1] 0.7203989
#MAPE(pred_new, df_test$actual_productivity) Our new model “model_back” error is bigger than our last model “pred_both”. Based on RMSE, our “pred_both” model is better than our new models even we take out the outlier. So we will tery use another method for tuning our data.
Linearity Check
lin_ev <- data.frame(residual = model_back$residuals, fitted = model_back$fitted.values)
lin_ev %>%
ggplot(aes(x= fitted,y = residual)) +
geom_point() +
geom_smooth() +
geom_hline(aes(yintercept = 0))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Normality of Residuals
shapiro.test(model_back$residuals)##
## Shapiro-Wilk normality test
##
## data: model_back$residuals
## W = 0.8516, p-value < 2.2e-16
Homoscedacity
bptest(model_back)##
## studentized Breusch-Pagan test
##
## data: model_back
## BP = 96.342, df = 5, p-value < 2.2e-16
Little to no multicollinearity
vif(model_back)## incentive targeted_productivity smv
## 1.468937 1.377447 1.730808
## team no_of_workers
## 1.239884 1.660926
For overall this new model is not significant different with the our first model result and still reject from 2 assumption test. So we try to tuning our data set again because we want to make better model. Our model not accepted by homoscedacity check and Residuals not normally distributed.
Define the boxcox number, first we make a plot use boxcox
library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
boxcox(model_new_none, plotit = TRUE, lambda = seq(1.6, 1.9, by = 0.1))We will use 1.8 because it is close intersection line.
model_new <- lm((((actual_productivity^1.8)-1)/1.8)~., data = df_train_new)
model_sn <- lm(model_new, direction = "backward")## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'direction' will be disregarded
summary(model_sn)##
## Call:
## lm(formula = model_new, direction = "backward")
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.205367 -0.019987 0.000595 0.020097 0.167459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.406e-01 2.219e-02 -28.867 < 2e-16 ***
## quarterQuarter2 -1.290e-03 5.110e-03 -0.253 0.80073
## quarterQuarter3 -6.606e-03 5.837e-03 -1.132 0.25825
## quarterQuarter4 -7.645e-03 5.510e-03 -1.387 0.16589
## quarterQuarter5 -9.116e-03 1.030e-02 -0.885 0.37648
## team -3.048e-03 6.277e-04 -4.857 1.58e-06 ***
## targeted_productivity 4.800e-01 2.438e-02 19.686 < 2e-16 ***
## smv -2.502e-03 3.587e-04 -6.977 9.17e-12 ***
## incentive 2.275e-03 8.799e-05 25.854 < 2e-16 ***
## no_of_workers 7.510e-04 2.642e-04 2.843 0.00464 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04446 on 524 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.8245
## F-statistic: 279.3 on 9 and 524 DF, p-value: < 2.2e-16
pred_sn <- predict(model_sn,newdata = data.frame(df_test))RMSE
sqrt(mean((pred_sn)^2))## [1] 0.2744685
After we use boxcox method for tuning, we get better adj.r.squared score than all of other model and the RMSE score low than the target’s range. Then we must chcek the assummption test.
Linearity Check
lin_ev <- data.frame(residual = model_new$residuals, fitted = model_new$fitted.values)
lin_ev %>%
ggplot(aes(x= fitted,y = residual)) +
geom_point() +
geom_smooth() +
geom_hline(aes(yintercept = 0))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Normality of Residuals
shapiro.test(model_new$residuals)##
## Shapiro-Wilk normality test
##
## data: model_new$residuals
## W = 0.92702, p-value = 1.927e-15
Homoscedacity
bptest(model_new)##
## studentized Breusch-Pagan test
##
## data: model_new
## BP = 88.888, df = 9, p-value = 2.72e-15
Little to no multicollinearity
vif(model_new)## GVIF Df GVIF^(1/(2*Df))
## quarter 1.076791 4 1.009291
## team 1.241652 1 1.114295
## targeted_productivity 1.424020 1 1.193323
## smv 1.739834 1 1.319028
## incentive 1.539679 1 1.240838
## no_of_workers 1.665534 1 1.290556
Tuning model by boxcox using data train from “takeout outlier” tuning model have better adj.r.squared and RMSE than other models, but 2 assumption test still reject our model.
In this case our model not fit with the data set, even tough we have good score for the model. RMSE score for our error not good because it is large in our range, and our model reject from 2 assumption test. For the future this data set maybe can use another Machine Learning Method or another tuning data set method.