Introduction
This dataset is titled “Productivity Prediction of Garment Employees”
and shows various explanatory variables and one target/response variable
(actual_productivity). The purpose of collecting this dataset is to see
what factors may cause employees productivity levels to be high or low.
This dataset was collected from Kaggle. This dataset contains 1197
observations and 15 variables (14 feature variables and 1 target
variable).
The date (categorical) shows the date in MM-DD-YYYY format. However,
the date is more of an identifier than a predictor. The quarter
(categorical) shows the portion of the month (divided into 4 quarters).
The department (categorical) shows the department (sewing/finishing).
The day (categorical) shows the day of the week. The team (categorical
represented as numeric) shows the team number. The targeted_productivity
(numerical) shows the targeted productivity levels. The smv (numerical)
shows the Standard Minute Value (allocated time for a task). The wip
(numerical) shows the work in progress (Includes the number of
unfinished items for products). The over_time (numerical) shows the
amount of overtime by each team in minutes. The incentive (numerical)
shows the amount of financial incentive in BDT. The idle_time
(numerical) shows the amount of time when the production was
interrupted. The idle_men (numerical) shows the number of workers who
were idle due to production interruption. The no_of_style_change
(numerical) shows the number of changes in the style of a particular
product. The no_of_workers (numerical) shows the number of of workers in
each team. The target/response variable (actual_productivity) shows the
actual productivity level (between 0-1).
The original dataset has 506 missing values in the wip column. This
can be resolved by MICE imputation.
Based on this dataset, we can formulate a research question. The
question we can form is what variables are the most important in
predicting productivity level of employees. For regression, we can use
actual_productivity (continuous) as our response variable. For
classification, we can create a binary response variable with 1
(productive) and 0 (not productive). We can use the condition if
actual_productivity is greater than the targeted_productivity.
We will use Regularized Linear/logistic Regression and SVM/SVR
(Linear/RBF kernels) to try to answer the research question.
Read in Dataset
Read in dataset.
garment <- read.csv("garments_worker_productivity.csv")
See dataset structure.
str(garment)
'data.frame': 1197 obs. of 15 variables:
$ date : chr "1/1/2015" "1/1/2015" "1/1/2015" "1/1/2015" ...
$ quarter : chr "Quarter1" "Quarter1" "Quarter1" "Quarter1" ...
$ department : chr "sweing" "finishing " "sweing" "sweing" ...
$ day : chr "Thursday" "Thursday" "Thursday" "Thursday" ...
$ team : int 8 1 11 12 6 7 2 3 2 1 ...
$ targeted_productivity: num 0.8 0.75 0.8 0.8 0.8 0.8 0.75 0.75 0.75 0.75 ...
$ smv : num 26.16 3.94 11.41 11.41 25.9 ...
$ wip : int 1108 NA 968 968 1170 984 NA 795 733 681 ...
$ over_time : int 7080 960 3660 3660 1920 6720 960 6900 6000 6900 ...
$ incentive : int 98 0 50 50 50 38 0 45 34 45 ...
$ idle_time : num 0 0 0 0 0 0 0 0 0 0 ...
$ idle_men : int 0 0 0 0 0 0 0 0 0 0 ...
$ no_of_style_change : int 0 0 0 0 0 0 0 0 0 0 ...
$ no_of_workers : num 59 8 30.5 30.5 56 56 8 57.5 55 57.5 ...
$ actual_productivity : num 0.941 0.886 0.801 0.801 0.8 ...
Check for Missing
Values
506 missing values in the wip column.
colSums(is.na(garment))
date quarter department
0 0 0
day team targeted_productivity
0 0 0
smv wip over_time
0 506 0
incentive idle_time idle_men
0 0 0
no_of_style_change no_of_workers actual_productivity
0 0 0
EDA
This section will show the distribution of each individual
feature.
The below figure shows the distribution of the quarter variable.
ggplot(garment, aes(x = quarter)) +
geom_bar() +
labs(title = "quarter")

The below figure shows the distribution of the department variable.
Data entry errors in this column were resolved.
garment$department[garment$department == 'finishing '] <- 'finishing'
ggplot(garment, aes(x = department)) +
geom_bar() +
labs(title = "department")

The below figure shows the distribution of the day variable.
ggplot(garment, aes(x = day)) +
geom_bar() +
labs(title = "day")

The below figure shows the distribution of the team variable.
# ggplot(data = garment, aes(x = team)) +
# geom_boxplot() +
#
# labs(title = "team")
garment$team <- as.factor(garment$team)
ggplot(garment, aes(x = team)) +
geom_bar() +
labs(title = "team")

The below figure shows the distribution of the targeted_productivity
variable.
ggplot(data = garment, aes(x = targeted_productivity)) +
geom_boxplot() +
labs(title = "targeted_productivity")

The below figure shows the distribution of the smv variable.
ggplot(data = garment, aes(x = smv)) +
geom_boxplot() +
labs(title = "smv")

The below figure shows the distribution of the wip variable.
ggplot(data = garment, aes(x = wip)) +
geom_boxplot() +
labs(title = "wip")

The below figure shows the distribution of the over_time
variable.
ggplot(data = garment, aes(x = over_time)) +
geom_boxplot() +
labs(title = "over_time")

The below figure shows the distribution of the incentive
variable.
ggplot(data = garment, aes(x = incentive)) +
geom_boxplot() +
labs(title = "incentive")

The below figure shows the distribution of the idle_time
variable.
ggplot(data = garment, aes(x = idle_time)) +
geom_boxplot() +
labs(title = "idle_time")

The below figure shows the distribution of the idle_men variable.
ggplot(data = garment, aes(x = idle_men)) +
geom_boxplot() +
labs(title = "idle_men")

The below figure shows the distribution of the no_of_style_change
variable.
ggplot(data = garment, aes(x = no_of_style_change)) +
geom_boxplot() +
labs(title = "no_of_style_change")

The below figure shows the distribution of the no_of_workers
variable.
ggplot(data = garment, aes(x = no_of_workers)) +
geom_boxplot() +
labs(title = "no_of_workers")

Imputation
We use the MICE imputation method in this section to replace all
missing values in the wip column.
complete_garment_data is the complete dataset.
init <- mice(garment, maxit = 0)
init$method
date quarter department
"" "" ""
day team targeted_productivity
"" "" ""
smv wip over_time
"" "pmm" ""
incentive idle_time idle_men
"" "" ""
no_of_style_change no_of_workers actual_productivity
"" "" ""
imp <- mice(garment, method = c("","", "", "", "", "", "", "pmm", "", "", "", "", "", "", ""),
maxit = 10,
m = 5,
seed=123,
print=F)
complete_garment_data <- complete(imp)
Feature
Engineering
This section focuses on feature engineering.
The ‘team’ variable was already converted to a categorical/factor
variable in the EDA section. Date was dropped from the analysis as it
really serves as more of an identifier. Quarter was also dropped since
it is reliant on the date and the counts for all quarters vary
significantly, as shown in the EDA section.
Below, we have the ‘department’ variable (sewing/finishing). This
variable is converted to binary, with sewing as 0 and finishing as
1.
complete_garment_data1 <- complete_garment_data
complete_garment_data1$department <- ifelse(complete_garment_data1$department=="sweing",0,1)
complete_garment_data1$department <- as.factor(complete_garment_data1$department)
Below, we do one hot encoding for the ‘day’ variable since the values
are nominal.
encoded_data <- model.matrix(~ day - 1, data = complete_garment_data1)
complete_garment_data1 <- cbind(complete_garment_data1, encoded_data)
complete_garment_data1$dayMonday <- as.factor(complete_garment_data1$dayMonday)
complete_garment_data1$daySaturday <- as.factor(complete_garment_data1$daySaturday)
complete_garment_data1$daySunday <- as.factor(complete_garment_data1$daySunday)
complete_garment_data1$dayThursday <- as.factor(complete_garment_data1$dayThursday)
complete_garment_data1$dayTuesday <- as.factor(complete_garment_data1$dayTuesday)
complete_garment_data1$dayWednesday <- as.factor(complete_garment_data1$dayWednesday)
Regularized Linear
Regression
In this section, we do regularized linear regression. We do lasso,
ridge, and elastic net regularization. For each regularization
technique, we conduct coefficient path analysis to see how well each
predictor variable shrinks. Then we use cross-validation to determine
the optimal regularization parameter for each. Lastly, we generate the
final model based on the optimal regularization parameters.
Coefficient Path
Analysis
Below, we split the data into our training and testing. Then we
perform lasso, ridge, and elastic net.
set.seed(112233)
X <- model.matrix(~ department+team+dayMonday+daySaturday+ daySunday+ dayThursday+ dayTuesday+ dayWednesday, data = complete_garment_data1)[, -1]
X <- cbind(complete_garment_data1$targeted_productivity, complete_garment_data1$smv, complete_garment_data1$wip, complete_garment_data1$over_time, complete_garment_data1$incentive, complete_garment_data1$idle_time, complete_garment_data1$idle_men, complete_garment_data1$no_of_style_change, complete_garment_data1$no_of_workers, X)
colnames(X) <- c('targeted_productivity','smv', 'wip',
'over_time',
'incentive', 'idle_time' , 'idle_men',
'no_of_style_change', 'no_of_workers','department',
'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12','dayMonday','daySaturday','daySunday','dayThursday','dayTuesday','dayWednesday')
y <- complete_garment_data1$actual_productivity
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]
preprocess_params <- preProcess(X_train, method = c("center", "scale"))
X_train <- predict(preprocess_params, X_train)
X_test <- predict(preprocess_params, X_test)
fit_lasso <- glmnet(X_train,
y_train,
alpha = 1)
fit_ridge <- glmnet(X_train,
y_train,
alpha = 0)
fit_elastic_net <- glmnet(X_train,
y_train,
alpha = 0.5)
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)
cv_elastic_net <- cv.glmnet(X_train, y_train, alpha = 0.5)
Below shows the plot of the coefficient paths for all the predictor
variables for lasso. It shows how variables are shrunk to exactly 0.
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(fit_lasso, xvar = "lambda", label = TRUE,
lwd = 1.5,
main = "Coefficient Path Analysis: LASSO",
cex.main = 0.9,
col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)

Below shows the plot of the coefficient paths for all the predictor
variables for ridge. It shows how variables are shrunk towards 0.
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(fit_ridge, xvar = "lambda", label = TRUE,
lwd = 1.5,
main = "Coefficient Path Analysis: ridge",
cex.main = 0.9,
col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)

Below shows the plot of the coefficient paths for all the predictor
variables for elastic net.
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(fit_elastic_net, xvar = "lambda", label = TRUE,
lwd = 1.5,
main = "Coefficient Path Analysis: elastic net",
cex.main = 0.9,
col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)

Optimal
Parameters
As shown below, using the cross validation we did for glmnet and the
lambda.min (minimum cross-validated error) generated, we get the optimal
parameters for lasso, ridge, and elastic net. LASSO.opt = 0.1328,
Ridge.opt=0.1334, and Elasticnet.opt = 0.1329
best.lasso.lambda <- cv_lasso$lambda.min
best.ridge.lambda <- cv_ridge$lambda.min
best.elastic.net.lambda <- cv_elastic_net$lambda.min
##
# Lasso Regression (L1 Regularization):
# CAUTION: model formula differs from the regular regression formula
lasso_model.opt <- glmnet(X_train,
y_train,
alpha = 1, # lasso regression
lambda = best.lasso.lambda) # useser selected alpha, optimal lambda
# can be obtained through CV (see below)
lasso_predictions.opt <- predict(lasso_model.opt,
s = best.lasso.lambda, # user selected lambda value
# (regularization paremeter)
newx = X_test) # test data set
# The following RMSE of prediction serves as a validation - one step validation
lasso_rmse.opt <- sqrt(mean((y_test - lasso_predictions.opt)^2))
# Ridge Regression (L2 Regularization)
ridge_model.opt <- glmnet(X_train, y_train, alpha = 0, lambda = best.ridge.lambda)
ridge_predictions.opt <- predict(ridge_model.opt, s = best.ridge.lambda, newx = X_test)
ridge_rmse.opt <- sqrt(mean((y_test - ridge_predictions.opt)^2))
# Elastic Net (Combination of L1 and L2)
elastic_net_model.opt <- glmnet(X_train, y_train, alpha = 0.5, lambda = best.elastic.net.lambda)
elastic_net_predictions.opt <- predict(elastic_net_model.opt, s = 0.1, newx = X_test)
elastic_net_rmse.opt <- sqrt(mean((y_test - elastic_net_predictions.opt)^2))
RMSE.opt = cbind(LASSO.opt = lasso_rmse.opt,
Ridge.opt = ridge_rmse.opt,
Elasticnet.opt = elastic_net_rmse.opt)
pander(RMSE.opt)
Final Model
Equations
For lasso, we get the final model equation below based on the optimal
parameter from the previous section.
# lasso
best_lambda.lasso <- cv_lasso$lambda.min
coefficients.lasso <- coef(cv_lasso, s = best_lambda.lasso)
intercept.lasso <- coefficients.lasso[1]
betas.lasso <- coefficients.lasso[-1]
cat("Model equation: y =", round(intercept.lasso,4), "+",
paste(round(betas.lasso,4), colnames(X), sep = "*", collapse = " + "), "\n")
Model equation: y = 0.7338 + 0.0661*targeted_productivity + -0.0788*smv + 0.0059*wip + -0.0128*over_time + 0.0071*incentive + 0.0053*idle_time + -0.0293*idle_men + -0.0168*no_of_style_change + 0.1039*no_of_workers + 0.0224*department + -0.0145*team2 + 9e-04*team3 + -0.0049*team4 + -0.0113*team5 + -0.0195*team6 + -0.0273*team7 + -0.0256*team8 + -0.0253*team9 + -0.0228*team10 + -0.0366*team11 + -0.0122*team12 + -0.0049*dayMonday + 0.0029*daySaturday + -0.0035*daySunday + -0.0059*dayThursday + 0.0039*dayTuesday + 0*dayWednesday
For ridge, we get the final model equation below based on the optimal
parameter from the previous section.
# ridge
best_lambda.ridge <- cv_ridge$lambda.min
coefficients.ridge <- coef(cv_ridge, s = best_lambda.ridge)
intercept.ridge <- coefficients.ridge[1]
betas.ridge <- coefficients.ridge[-1]
cat("Model equation: y =", round(intercept.ridge,4), "+",
paste(round(betas.ridge,4), colnames(X), sep = "*", collapse = " + "), "\n")
Model equation: y = 0.7338 + 0.0637*targeted_productivity + -0.0532*smv + 0.0072*wip + -0.0083*over_time + 0.0077*incentive + 0.0045*idle_time + -0.0277*idle_men + -0.0164*no_of_style_change + 0.0564*no_of_workers + 0.0042*department + -0.0106*team2 + 0.005*team3 + -0.0013*team4 + -0.0085*team5 + -0.0193*team6 + -0.0242*team7 + -0.0224*team8 + -0.0212*team9 + -0.0187*team10 + -0.0293*team11 + -0.0117*team12 + -0.0038*dayMonday + 0.0041*daySaturday + -0.0028*daySunday + -0.0047*dayThursday + 0.0052*dayTuesday + 0.0019*dayWednesday
For elastic net, we get the final model equation below based on the
optimal parameter from the previous section.
# elastic net
best_lambda.net <- cv_elastic_net$lambda.min
coefficients.net <- coef(cv_elastic_net, s = best_lambda.net)
intercept.net <- coefficients.net[1]
betas.net<- coefficients.net[-1]
cat("Model equation: y =", round(intercept.net,4), "+",
paste(round(betas.net,4), colnames(X), sep = "*", collapse = " + "), "\n")
Model equation: y = 0.7338 + 0.066*targeted_productivity + -0.076*smv + 0.006*wip + -0.012*over_time + 0.0071*incentive + 0.0051*idle_time + -0.029*idle_men + -0.0166*no_of_style_change + 0.0965*no_of_workers + 0.0185*department + -0.0135*team2 + 0.0018*team3 + -0.0038*team4 + -0.0104*team5 + -0.0193*team6 + -0.0265*team7 + -0.0248*team8 + -0.0243*team9 + -0.0219*team10 + -0.0353*team11 + -0.0119*team12 + -0.004*dayMonday + 0.0035*daySaturday + -0.0028*daySunday + -0.0051*dayThursday + 0.0045*dayTuesday + 8e-04*dayWednesday
Regularized Logistic
Regression
Coefficient Path
Analysis
In this section, we do regularized logistic regression. The process
is the same as regularized linear.
First, we create a binary response variable called ‘productivity’,
where 1 represents productive and 0 represents not productive. This is
based on if actual productivity level is greater than the targeted
productivity.
Then we split the data into train/test, generate the lasso, ridge,
and elastic net models, and generate our optimal lambda values.
complete_garment_data1 <- transform(complete_garment_data1, productivity=ifelse(actual_productivity>=targeted_productivity, 1, 0))
set.seed(80)
X1 <- model.matrix(~ department+team+dayMonday+daySaturday+ daySunday+ dayThursday+ dayTuesday+ dayWednesday, data = complete_garment_data1)[, -1]
X1 <- cbind(complete_garment_data1$targeted_productivity, complete_garment_data1$smv, complete_garment_data1$wip, complete_garment_data1$over_time, complete_garment_data1$incentive, complete_garment_data1$idle_time, complete_garment_data1$idle_men, complete_garment_data1$no_of_style_change, complete_garment_data1$no_of_workers, X1)
#X <- as.data.frame(X)
colnames(X1) <- c('targeted_productivity','smv', 'wip',
'over_time',
'incentive', 'idle_time' , 'idle_men',
'no_of_style_change', 'no_of_workers','department',
'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12','dayMonday','daySaturday','daySunday','dayThursday','dayTuesday','dayWednesday')
#X1 <- as.matrix(complete_garment_data1[, -c(1,2,4,15,16)])
y1 <- complete_garment_data1$productivity
train_index1 <- createDataPartition(y1, p = 0.8, list = FALSE)
X_train1 <- X1[train_index1, ]
X_test1 <- X1[-train_index1, ]
y_train1 <- y1[train_index1]
y_test1 <- y1[-train_index1]
preprocess_params1 <- preProcess(X_train1, method = c("center", "scale"))
X_train1 <- predict(preprocess_params1, X_train1)
X_test1 <- predict(preprocess_params1, X_test1)
lasso_model1 <- glmnet(X_train1, y_train1, family = "binomial", alpha = 1)
cv_lasso1 <- cv.glmnet(X_train1, y_train1, family = "binomial", alpha = 1)
lambda_lasso1 <- cv_lasso1$lambda.min
lasso_model_opt1 <- glmnet(X_train1, y_train1,
family = "binomial",
alpha = 1,
lambda = lambda_lasso1)
ridge_model1 <- glmnet(X_train1, y_train1, family = "binomial", alpha = 0)
cv_ridge1 <- cv.glmnet(X_train1, y_train1, family = "binomial", alpha = 0)
lambda_ridge1 <- cv_ridge1$lambda.min
ridge_model_opt1 <- glmnet(X_train1, y_train1,
family = "binomial",
alpha = 0,
lambda = lambda_ridge1)
elastic_model1 <- glmnet(X_train1, y_train1, family = "binomial", alpha = 0.5)
cv_elastic1 <- cv.glmnet(X_train1, y_train1, family = "binomial", alpha = 0.5)
lambda_elastic1 <- cv_elastic1$lambda.min
elastic_model_opt1 <- glmnet(X_train1, y_train1,
family = "binomial",
alpha = 0.5,
lambda = lambda_elastic1)
Below shows the plot of the coefficient paths for all the predictor
variables for lasso.
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(lasso_model1, xvar = "lambda", label = TRUE,
lwd = 1.5,
main = "Coefficient Path Analysis: lasso",
cex.main = 0.9,
col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)

Below shows the plot of the coefficient paths for all the predictor
variables for ridge.
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(ridge_model1, xvar = "lambda", label = TRUE,
lwd = 1.5,
main = "Coefficient Path Analysis: ridge",
cex.main = 0.9,
col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)

Below shows the plot of the coefficient paths for all the predictor
variables for elastic net.
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(elastic_model1, xvar = "lambda", label = TRUE,
lwd = 1.5,
main = "Coefficient Path Analysis: elastic net",
cex.main = 0.9,
col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)

Regularization
Parameters
Below, we show the coefficients for each predictor variable based on
the optimal lambda values (for lasso/ridge/elastic).
lasso.coef1 <- as.matrix(coef(lasso_model_opt1))
ridge.coef1 <- as.matrix(coef(ridge_model_opt1))
elastic.coef1 <- as.matrix(coef(elastic_model_opt1))
regularized.coef1 <- data.frame(lasso = lasso.coef1[,1],
ridge = ridge.coef1[,1],
elasticnet = elastic.coef1[,1])
pander(regularized.coef1)
| (Intercept) |
1.263 |
1.254 |
1.298 |
| targeted_productivity |
0 |
0.007142 |
0 |
| smv |
-0.8924 |
-0.628 |
-0.9713 |
| wip |
0.2414 |
0.3189 |
0.3496 |
| over_time |
0 |
0.006669 |
-0.01018 |
| incentive |
0.0507 |
0.07586 |
0.05811 |
| idle_time |
0.01121 |
0.04312 |
0.04022 |
| idle_men |
-0.3018 |
-0.3066 |
-0.3264 |
| no_of_style_change |
-0.1462 |
-0.1505 |
-0.1646 |
| no_of_workers |
0.6497 |
0.5627 |
0.7433 |
| department |
-0.886 |
-0.6879 |
-0.8928 |
| team2 |
-0.0874 |
-0.1596 |
-0.1803 |
| team3 |
0.07338 |
0.05198 |
0.02416 |
| team4 |
0 |
-0.05696 |
-0.06196 |
| team5 |
-0.06979 |
-0.1444 |
-0.1531 |
| team6 |
-0.1828 |
-0.2255 |
-0.2642 |
| team7 |
-0.1707 |
-0.234 |
-0.2493 |
| team8 |
-0.2553 |
-0.3233 |
-0.3415 |
| team9 |
-0.1998 |
-0.2619 |
-0.2942 |
| team10 |
-0.1458 |
-0.2048 |
-0.2324 |
| team11 |
-0.2054 |
-0.2433 |
-0.2937 |
| team12 |
0 |
-0.04955 |
-0.07529 |
| dayMonday |
0 |
0.0559 |
0.01095 |
| daySaturday |
0.0003255 |
0.05972 |
0.01684 |
| daySunday |
-0.1297 |
-0.1019 |
-0.1377 |
| dayThursday |
-0.1334 |
-0.09887 |
-0.1433 |
| dayTuesday |
0 |
0.06016 |
0.01499 |
| dayWednesday |
0 |
0.03093 |
0 |
Optimal Cutoff and
ROC/AUC
The below figure shows the optimal cut-off probability for the
lasso/ridge/elastic net models. This will allow us to use the fitted
model for predicting the response variable (productivity). All the
optimal cut-off probabilities are approx 0.55
predict_lasso <- predict(lasso_model_opt1, newx = X_test1, type = "response")
predict_ridge <- predict(ridge_model_opt1, newx = X_test1, type = "response")
predict_elastic <- predict(elastic_model_opt1, newx = X_test1, type = "response")
###########################################
## Optimal cutoff probability determination
seq.cut <- seq(0,1, length=50)
# y is a vector of 0 and 1
acc.lasso <- NULL
acc.ridge <- NULL
acc.elastic <- NULL
for (i in 1:length(seq.cut)){
predy.lasso <- ifelse(predict_lasso >seq.cut[i], 1, 0)
predy.ridge<- ifelse(predict_ridge >seq.cut[i], 1, 0)
predy.elastic<- ifelse(predict_elastic >seq.cut[i], 1, 0)
##
acc.lasso[i] <- mean(y_test1 == predy.lasso)
acc.ridge[i] <- mean(y_test1 == predy.ridge)
acc.elastic[i] <- mean(y_test1 == predy.elastic)
}
## optimal cut-off: if the maximum accuracy occurs at multiple
## cut-off probabilities, the average of these cutoff probabilities
## will be defined as the optimal cutoff probability
opt.cut.lasso <- mean(seq.cut[which(acc.lasso==max(acc.lasso))])
opt.cut.ridge<- mean(seq.cut[which(acc.ridge==max(acc.ridge))])
opt.cut.elastic <- mean(seq.cut[which(acc.elastic==max(acc.elastic))])
##
acc.data <- data.frame(prob = rep(seq.cut,3),
acc=c(acc.lasso, acc.ridge, acc.elastic),
group = c(rep("lasso",50), rep("ridge",50), rep("elastic",50)))
##
gg.acc <- ggplot(data = acc.data, aes(x=prob, y = acc, color = group)) +
geom_line() +
annotate("text", x = 0.6, y = 0.45,
label = paste("LASSO cutoff: ", round(opt.cut.lasso,5), "Accuracy: ", round(max(acc.lasso),5),
"\nRidge cutoff: ", round(opt.cut.ridge,5), "Accuracy: ", round(max(acc.ridge),5),
"\nElastic cutoff: ", round(opt.cut.elastic,5), "Accuracy: ", round(max(acc.elastic),5)),
size = 3,
color = "navy") +
ggtitle("Cut-off Probability vs Accuracy") +
labs(x = "cut-off Probability",
y = "accuracy", color = "Group") +
theme(plot.title = element_text(hjust = 0.5))
##
ggplotly(gg.acc)
We can compare the performance of the lasso, Ridge, and Elastic Net
models on our test data using ROC curves. All models have similar
performances with AUC of approx 0.75.
prob_lasso <- predict(lasso_model_opt1, newx = X_test1, type = "response")
prob_ridge <- predict(ridge_model_opt1, newx = X_test1, type = "response")
prob_elastic <- predict(elastic_model_opt1, newx = X_test1, type = "response")
# Compute ROC curves: roc object contains a lot information including
# sensitivity, specificity, AUC, etc.
roc_lasso <- roc(y_test1, prob_lasso)
roc_ridge <- roc(y_test1, prob_ridge)
roc_elastic <- roc(y_test1, prob_elastic)
# Compute AUC values
auc_lasso <- auc(roc_lasso)
auc_ridge <- auc(roc_ridge)
auc_elastic <- auc(roc_elastic)
## LASSO
sen.lasso <- roc_lasso$sensitivities
spe.lasso <- roc_lasso$specificities
auc.lasso <- roc_lasso$auc
## Ridge
sen.ridge <- roc_ridge$sensitivities
spe.ridge <- roc_ridge$specificities
auc.ridge <- roc_ridge$auc
## Elastic Net
sen.elastic <- roc_elastic$sensitivities
spe.elastic <- roc_elastic$specificities
auc.elastic <- roc_elastic$auc
## Plotting the ROC curves: three colors - green, orange, and purple
plot(1-spe.lasso, sen.lasso,
type = "l",
col = "green",
xlim=c(0,1),
xlab = "1 - specificity",
ylab = "sensitivity",
main = "ROC Curves for LASSO, Ridge, and Elastic Net")
lines(1-spe.ridge, sen.ridge, col = "orange")
lines(1-spe.elastic, sen.elastic, col = "purple")
#abline(0,1, type = "l", lty = 2, col = "steelblue", lwd = 1)
abline(0,1, , lty = 2, col = "steelblue", lwd = 1)
# Add legend
legend("bottomright", legend = c(paste("LASSO (AUC =", round(auc_lasso, 3), ")"),
paste("Ridge (AUC =", round(auc_ridge, 3), ")"),
paste("Elastic Net (AUC =", round(auc_elastic, 3), ")")),
col = c("green", "orange", "purple"), lty = 1, cex = 0.8, bty = "n")

SVM
This section uses Support Vector Machines. Since our target variable
(productivity) has two classes, SVM will find a hyperplane that
maximizes the margin between the two classes.
The below code chunks will perform SVM by using 5-fold cross
validation to search for the optimal hyperparameters and using those to
train the final model. We will use linear and RBF kernels.
complete_garment_data1$productivity <- as.factor(complete_garment_data1$productivity)
set.seed(123)
index <- sample(1:nrow(complete_garment_data1), 0.8 * nrow(complete_garment_data1))
train.data <- complete_garment_data1[index, ]
test.data <- complete_garment_data1[-index, ]
tune_control <- tune.control(
cross = 5,
nrepeat = 1
)
tune.RBF <- tune(
svm,
productivity ~ department + team + targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men +
no_of_style_change + no_of_workers + dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday,
data = train.data,
kernel = "radial",
ranges = list(
cost = 10^(-1:2),
gamma = c(0.1, 0.5, 1, 2)
),
tunecontrol = tune_control
)
best.RBF <- tune.RBF$best.model
best.cost.RBF <- best.RBF$cost
best.gamma.RBF <- best.RBF$gamma
final.RBF <- svm(
productivity ~ department + team + targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men +
no_of_style_change + no_of_workers + dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday,
data = train.data,
kernel = "radial",
cost = best.cost.RBF,
gamma = best.gamma.RBF,
probability = TRUE
)
##################
# tune.control <- tune.control(
# cross = 5, # Use 5-fold cross-validation, the default is 10-fold cross-validation
# nrepeat = 1 # Number of repetitions (for repeated cross-validation)
# )
tune.lin <- tune(
svm,
productivity ~ department + team + targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men +
no_of_style_change + no_of_workers + dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday,
data = train.data,
kernel = "linear",
ranges = list(
cost = 10^(-1:2)
),
tunecontrol = tune_control
)
best.lin <- tune.lin$best.model
best.cost.lin <- best.lin$cost
final.lin <- svm(
productivity ~ department + team + targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men +
no_of_style_change + no_of_workers + dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday,
data = train.data,
kernel = "linear",
cost = best.cost.lin,
probability = TRUE
)
logit.fit <- glm(productivity ~ department + team + targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men + no_of_style_change + no_of_workers+ dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday, data = train.data, family = binomial)
AIC.logit <- step(logit.fit, direction = "both", trace = 0)
pred.logit <- predict(AIC.logit, test.data, type = "response")
pred.prob.lin <- predict(final.lin, test.data, probability = TRUE)
pred.prob.RBF <- predict(final.RBF, test.data, probability = TRUE)
#####
prob.linear <- attr(pred.prob.lin, "probabilities")[, 2]
prob.radial <- attr(pred.prob.RBF, "probabilities")[, 2]
#####
roc_lin <- roc(test.data$productivity, prob.linear)
roc_RBF <- roc(test.data$productivity, prob.radial)
roc_logit <- roc(test.data$productivity, pred.logit)
lin.sen <- roc_lin$sensitivities
lin.spe <- roc_lin$specificities
rad.sen <- roc_RBF$sensitivities
rad.spe <- roc_RBF$specificities
logit.sen <- roc_logit$sensitivities
logit.spe <- roc_logit$specificities
auc.lin <- roc_lin$auc
auc.rad <- roc_RBF$auc
auc.logit <- roc_logit$auc
Below we plot the ROC for linear, RBF, and standard logistic
regression. Based on the AUC, the SVM linear, SVM RBF, and standard
logistic are similar with area of approx 0.8.
plot(1-lin.spe, lin.sen,
xlab = "1 - specificity",
ylab = "sensitivity",
col = "darkred",
type = "l",
lty = 1,
lwd = 1,
main = "ROC Curves of SVM")
lines(1-rad.spe, rad.sen,
col = "blue",
lty = 1,
lwd = 1)
lines(1-logit.spe, logit.sen,
col = "orange",
lty = 1,
lwd = 1)
abline(0,1, col = "skyblue3", lty = 2, lwd = 2)
abline(v=c(0.049,0.151), lty = 3, col = "darkgreen")
legend("bottomright", c("Linear Kernel", "Radial Kernel", "Logistic Regression"),
lty = c(1,1,1), lwd = rep(1,3),
col = c("red", "blue", "orange"),
bty="n",cex = 0.8)
text(0.8, 0.46, paste("Linear AUC: ", round(auc.lin,4)), cex = 0.8)
text(0.8, 0.4, paste("Radial AUC: ", round(auc.rad,4)), cex = 0.8)
text(0.8, 0.34, paste("Logistic AUC: ", round(auc.logit,4)), cex = 0.8)

SVR
In this section, we will do SVR. Since we have a continuous target
variable, we cannot use a hyperplane to separate classes. Here we try to
fit the predicted data within a specified margin of error. Below we
train the SVR model. We use RBF/linear kernels and standard linear
regression.
X2 <- model.matrix(~ department+team+ dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday, data = complete_garment_data1)[, -1]
X2 <- cbind(complete_garment_data1$targeted_productivity, complete_garment_data1$smv, complete_garment_data1$wip, complete_garment_data1$over_time, complete_garment_data1$incentive, complete_garment_data1$idle_time, complete_garment_data1$idle_men, complete_garment_data1$no_of_style_change, complete_garment_data1$no_of_workers, X2)
#X <- as.data.frame(X)
colnames(X2) <- c('targeted_productivity','smv', 'wip',
'over_time',
'incentive', 'idle_time' , 'idle_men',
'no_of_style_change', 'no_of_workers','department',
'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12','dayMonday','daySaturday','daySunday','dayThursday','dayTuesday','dayWednesday')
X2 <- as.data.frame(X2)
#
# X2 <- cbind(complete_garment_data1[, 15] ,X2)
# #
# #
# colnames(X2) <- c('actual_productivity', 'targeted_productivity','smv', 'wip',
# 'over_time',
# 'incentive', 'idle_time' , 'idle_men',
# 'no_of_style_change', 'no_of_workers','department',
# 'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12')
#X2 <- complete_garment_data1[, -c(1,2,4,15,16)]
y2 <- complete_garment_data1[, 15]
set.seed(123)
train.index2 <- sample(1:nrow(X2), 0.8 * nrow(X2))
X.train2 <- X2[train.index2, ]
y.train2 <- y2[train.index2]
X.test2 <- X2[-train.index2, ]
y.test2 <- y2[-train.index2]
tune.RBF2 <- tune(svm, train.x = X.train2, train.y = y.train2,
ranges = list(epsilon = seq(0.1, 0.5, 0.1),
cost = c(1, 10, 100),
gamma = c(0.01, 0.1, 1)), # Hyperpar in RBF
tunecontrol = tune_control
)
final.RBF2 <- svm(X.train2, y.train2,
type = "eps-regression", # Use "nu-regression" for nu-SVR
kernel = "radial",
epsilon = tune.RBF2$best.parameters$epsilon,
cost = tune.RBF2$best.parameters$cost,
gamma = tune.RBF2$best.parameters$gamma)
pred.RBF <- predict(final.RBF2, X.test2)
# Evaluate performance
mse.RBF <- mean((y.test2 - pred.RBF)^2) # mean square error
mae.RBF <- mean(abs(y.test2 - pred.RBF))
#############################################################
tune.lin2 <- tune(svm, train.x = X.train2, train.y = y.train2,
ranges = list(epsilon = seq(0.1, 0.5, 0.1),
cost = c(1, 10, 100)),
tunecontrol = tune_control
)
final.lin2 <- svm(X.train2, y.train2,
type = "eps-regression",
kernel = "linear",
epsilon = tune.lin2$best.parameters$epsilon,
cost = tune.lin2$best.parameters$cost)
pred.lin <- predict(final.lin2, X.test2)
# Evaluate performance
mse.lin <- mean((y.test2 - pred.lin)^2) # mean square error
mae.lin <- mean(abs(y.test2 - pred.lin))
X2 <- cbind(complete_garment_data1[, 15] ,X2)
colnames(X2) <- c('actual_productivity', 'targeted_productivity','smv', 'wip',
'over_time',
'incentive', 'idle_time' , 'idle_men',
'no_of_style_change', 'no_of_workers','department',
'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12','dayMonday','daySaturday','daySunday','dayThursday','dayTuesday','dayWednesday')
garment.train <- X2[train.index2, ]
garment.test <- X2[-train.index2, ]
# X3 <- cbind(complete_garment_data1[, 15] ,X3)
#
#
# colnames(X3) <- c('actual_productivity', 'targeted_productivity','smv', 'wip',
# 'over_time',
# 'incentive', 'idle_time' , 'idle_men',
# 'no_of_style_change', 'no_of_workers','department',
# 'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12')
lse.fit <- lm(actual_productivity ~ targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men + no_of_style_change + no_of_workers + department + team2+ team3+ team4+ team5+ team6+ team7+ team8+ team9+ team10+ team11+ team12 + dayMonday + daySaturday + daySunday + dayThursday + dayTuesday + dayWednesday,data=garment.train)
AIC.fit <- stepAIC(lse.fit,direction="both", trace = FALSE)
pred.lse <- predict(AIC.fit, X.test2)
mse.lse <- mean((y.test2 - pred.lse)^2) # mean square error
mae.lse <- mean(abs(y.test2 - pred.lse)) # mean absolute error
###
par(mfrow=c(2,2), mar=c(2,2,2,2))
plot(AIC.fit)

The above shows the residual plots when we did the OLS regression.
The residuals vs fitted plot does not seem to show any curve patterns so
no transformation is necessary.
Below we compare the performances of SVR Linear, SVR RBF, and OLS
regression. The MSE and MAE values are all similar, with RBF kernel
doing the best.
Performance <- data.frame(RBF.SVR=c(mse.RBF, mae.RBF),
Linear.SVR = c(mse.lin, mae.lin),
LSE.Reg =c(mse.lse, mae.lse))
row.names(Performance) <- c("MSE", "MAE")
##
pander(Performance)
| MSE |
0.01842 |
0.02233 |
0.02067 |
| MAE |
0.08615 |
0.09348 |
0.1006 |
Results/Conclusion
For regularized linear regression, the optimal parameters for lasso
is 0.1328, 0.1334 for ridge, and 0.1329 for elastic net. These RMSE
values are based on lambda.min (value of lambda that gives the minimum
cross-validated error). Based on these values, we get this equation for
lasso:
Model equation: y = 0.7338 + 0.0661targeted_productivity +
-0.0788smv + 0.0059wip + -0.0128over_time +
0.0071incentive + 0.0053idle_time + -0.0293idle_men +
-0.0168no_of_style_change + 0.1039no_of_workers +
0.0224department + -0.0145team2 + 9e-04team3 +
-0.0049team4 + -0.0113team5 + -0.0195team6 +
-0.0273team7 + -0.0256team8 + -0.0253team9 +
-0.0228team10 + -0.0366team11 + -0.0122team12 +
-0.0049dayMonday + 0.0029daySaturday + -0.0035daySunday +
-0.0059dayThursday + 0.0039dayTuesday + 0*dayWednesday
This equation for ridge:
Model equation: y = 0.7338 + 0.0637targeted_productivity +
-0.0532smv + 0.0072wip + -0.0083over_time +
0.0077incentive + 0.0045idle_time + -0.0277idle_men +
-0.0164no_of_style_change + 0.0564no_of_workers +
0.0042department + -0.0106team2 + 0.005team3 +
-0.0013team4 + -0.0085team5 + -0.0193team6 +
-0.0242team7 + -0.0224team8 + -0.0212team9 +
-0.0187team10 + -0.0293team11 + -0.0117team12 +
-0.0038dayMonday + 0.0041daySaturday + -0.0028daySunday +
-0.0047dayThursday + 0.0052dayTuesday + 0.0019*dayWednesday
This equation for elastic net:
Model equation: y = 0.7338 + 0.066targeted_productivity +
-0.076smv + 0.006wip + -0.012over_time +
0.0071incentive + 0.0051idle_time + -0.029idle_men +
-0.0166no_of_style_change + 0.0965no_of_workers +
0.0185department + -0.0135team2 + 0.0018team3 +
-0.0038team4 + -0.0104team5 + -0.0193team6 +
-0.0265team7 + -0.0248team8 + -0.0243team9 +
-0.0219team10 + -0.0353team11 + -0.0119team12 +
-0.004dayMonday + 0.0035daySaturday + -0.0028daySunday +
-0.0051dayThursday + 0.0045dayTuesday + 8e-04*dayWednesday
For lasso, ridge, and elastic in regularized logistic regression we
get the coefficients shown in ‘regularized.coef1’. When we compare the
performance of the lasso, Ridge, and Elastic Net models on our test data
using ROC curves, we see that all models have similar performances with
AUC of 0.762 for lasso, 0.757 for ridge, and 0.758 for elastic.
Since the optimal parameter was the lowest in regularized linear and
the AUC was highest in regularized logistic regression, a lasso model is
recommended. Lasso is preferred because it performs feature selection
(shrinking coefficients to 0), easier to interpret, and better for
High-dimensional data.
For SVM, the AUC for SVM (linear kernel) is 0.789, 0.8193 for SVM
(RBF kernel), and 0.79 for standard stepwise logistic regression.
Overall, since the AUC for SVM (both linear/RBF) are greater than the
AUCs in the regularized logistic regression, we can assume that SVM is
better because they are robust to overfitting, can use kernels, and can
handle linear/nonlinear data.
For SVR, SVR RBF has an MSE of 0.01842, 0.02233 for SVR Linear, and
0.02067 for OLS regression. Since the MSE values are lower than that in
regularized linear regression, we can assume that SVR is better.
---
title: "Productivity Prediction of Garment Employees Dataset Final Report"
author: "Eric Zhu"
date: "2025-03-31"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: no
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  font-weight: bold;
  color: navy;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 18px;
  font-family: system-ui;
  font-weight: bold;
  color: navy;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
  font-weight: bold;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 16px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 14px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>

```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}
if (!require("readxl")) { # SVM methodology
   install.packages("readxl")
library(readxl)
}
if (!require("ggplot2")) { # SVM methodology
   install.packages("ggplot2")
library(ggplot2)
}
if (!require("ISLR")) { # contains example data set "Khan"
   install.packages("ISLR")
library(ISLR)
}
if (!require("MASSExtra")) { # contains example data set "Khan"
   install.packages("MASSExtra")
library(MASSExtra)
}
if (!require("missMethods")) { # customized coloring of plots
   install.packages("missMethods")
library(missMethods)
}
if (!require("Hmisc")) { # customized coloring of plots
   install.packages("Hmisc")
library(Hmisc)
}
if (!require("glmnet")) { # customized coloring of plots
   install.packages("glmnet")
library(glmnet)
}
if (!require("GGally")) { # customized coloring of plots
   install.packages("GGally")
library(GGally)
}
if (!require("mice")) { # customized coloring of plots
   install.packages("mice")
library(mice)
}
if (!require("plotly")) { # customized coloring of plots
   install.packages("plotly")
library(plotly)
}
if (!require("caret")) { # customized coloring of plots
   install.packages("caret")
library(caret)
}
if (!require("pROC")) { # customized coloring of plots
   install.packages("pROC")
library(pROC)
}
if (!require("pander")) { # customized coloring of plots
   install.packages("pander")
library(pander)
}
if (!require("e1071")) { # customized coloring of plots
   install.packages("e1071")
library(e1071)
}
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )
```


# Introduction

This dataset is titled “Productivity Prediction of Garment Employees” and shows various explanatory variables and one target/response variable (actual_productivity). The purpose of collecting this dataset is to see what factors may cause employees productivity levels to be high or low. This dataset was collected from Kaggle. This dataset contains 1197 observations and 15 variables (14 feature variables and 1 target variable).

The date (categorical) shows the date in MM-DD-YYYY format. However, the date is more of an identifier than a predictor. The quarter (categorical) shows the portion of the month (divided into 4 quarters). The department (categorical) shows the department (sewing/finishing). The day (categorical) shows the day of the week. The team (categorical represented as numeric) shows the team number. The targeted_productivity (numerical) shows the targeted productivity levels. The smv (numerical) shows the Standard Minute Value (allocated time for a task). The wip (numerical) shows the work in progress (Includes the number of unfinished items for products). The over_time (numerical) shows the amount of overtime by each team in minutes. The incentive (numerical) shows the amount of financial incentive in BDT. The idle_time (numerical) shows the amount of time when the production was interrupted. The idle_men (numerical) shows the number of workers who were idle due to production interruption. The no_of_style_change (numerical) shows the number of changes in the style of a particular product. The no_of_workers (numerical) shows the number of of workers in each team. The target/response variable (actual_productivity) shows the actual productivity level (between 0-1).

The original dataset has 506 missing values in the wip column. This can be resolved by MICE imputation.

Based on this dataset, we can formulate a research question. The question we can form is what variables are the most important in predicting productivity level of employees. For regression, we can use actual_productivity (continuous) as our response variable. For classification, we can create a binary response variable with 1 (productive) and 0 (not productive). We can use the condition if actual_productivity is greater than the targeted_productivity.

We will use Regularized Linear/logistic Regression and SVM/SVR (Linear/RBF kernels) to try to answer the research question.

# Read in Dataset

Read in dataset.

```{r}
garment <- read.csv("garments_worker_productivity.csv")
```

See dataset structure.

```{r}
str(garment)
```


## Check for Missing Values

506 missing values in the wip column.

```{r}
colSums(is.na(garment))
```


# EDA

This section will show the distribution of each individual feature.

The below figure shows the distribution of the quarter variable.

```{r}
ggplot(garment, aes(x = quarter)) + 
  
  geom_bar() +
  labs(title = "quarter")
```

The below figure shows the distribution of the department variable. Data entry errors in this column were resolved.

```{r}

garment$department[garment$department == 'finishing '] <- 'finishing'

ggplot(garment, aes(x = department)) + 
  
  geom_bar() +
  labs(title = "department")
```

The below figure shows the distribution of the day variable.

```{r}
ggplot(garment, aes(x = day)) + 
  
  geom_bar() +
  labs(title = "day")
```

The below figure shows the distribution of the team variable.

```{r}
# ggplot(data = garment, aes(x = team)) + 
#   geom_boxplot() + 
#   
#   labs(title = "team")
garment$team <- as.factor(garment$team)

ggplot(garment, aes(x = team)) + 
  
  geom_bar() +
  labs(title = "team")
```

The below figure shows the distribution of the targeted_productivity variable.

```{r}
ggplot(data = garment, aes(x = targeted_productivity)) + 
  geom_boxplot() + 
  
  labs(title = "targeted_productivity")
```

The below figure shows the distribution of the smv variable.

```{r}
ggplot(data = garment, aes(x = smv)) + 
  geom_boxplot() + 
  
  labs(title = "smv")
```

The below figure shows the distribution of the wip variable.

```{r}
ggplot(data = garment, aes(x = wip)) + 
  geom_boxplot() + 
  
  labs(title = "wip")
```

The below figure shows the distribution of the over_time variable.

```{r}
ggplot(data = garment, aes(x = over_time)) + 
  geom_boxplot() + 
  
  labs(title = "over_time")
```

The below figure shows the distribution of the incentive variable.

```{r}
ggplot(data = garment, aes(x = incentive)) + 
  geom_boxplot() + 
  
  labs(title = "incentive")
```

The below figure shows the distribution of the idle_time variable.

```{r}
ggplot(data = garment, aes(x = idle_time)) + 
  geom_boxplot() + 
  
  labs(title = "idle_time")
```

The below figure shows the distribution of the idle_men variable.

```{r}
ggplot(data = garment, aes(x = idle_men)) + 
  geom_boxplot() + 
  
  labs(title = "idle_men")
```

The below figure shows the distribution of the no_of_style_change variable.

```{r}
ggplot(data = garment, aes(x = no_of_style_change)) + 
  geom_boxplot() + 
  
  labs(title = "no_of_style_change")
```

The below figure shows the distribution of the no_of_workers variable.

```{r}
ggplot(data = garment, aes(x = no_of_workers)) + 
  geom_boxplot() + 
  
  labs(title = "no_of_workers")
```

# Imputation

We use the MICE imputation method in this section to replace all missing values in the wip column.

complete_garment_data is the complete dataset.

```{r}
init <- mice(garment, maxit = 0)
init$method
```

```{r}
imp <- mice(garment, method = c("","", "", "",  "", "", "", "pmm", "", "", "", "", "", "", ""), 
                 maxit = 10,  
                 m = 5, 
                 seed=123,
                 print=F)     



complete_garment_data <- complete(imp)
```

# Feature Engineering

This section focuses on feature engineering.

The 'team' variable was already converted to a categorical/factor variable in the EDA section. Date was dropped from the analysis as it really serves as more of an identifier. Quarter was also dropped since it is reliant on the date and the counts for all quarters vary significantly, as shown in the EDA section.

Below, we have the 'department' variable (sewing/finishing). This variable is converted to binary, with sewing as 0 and finishing as 1.

```{r}

complete_garment_data1 <- complete_garment_data






complete_garment_data1$department <- ifelse(complete_garment_data1$department=="sweing",0,1)



complete_garment_data1$department <- as.factor(complete_garment_data1$department)



```

Below, we do one hot encoding for the 'day' variable since the values are nominal.

```{r}
encoded_data <- model.matrix(~ day - 1, data = complete_garment_data1)

complete_garment_data1 <- cbind(complete_garment_data1, encoded_data)

complete_garment_data1$dayMonday <- as.factor(complete_garment_data1$dayMonday)
complete_garment_data1$daySaturday <- as.factor(complete_garment_data1$daySaturday)
complete_garment_data1$daySunday <- as.factor(complete_garment_data1$daySunday)
complete_garment_data1$dayThursday <- as.factor(complete_garment_data1$dayThursday)
complete_garment_data1$dayTuesday <- as.factor(complete_garment_data1$dayTuesday)
complete_garment_data1$dayWednesday <- as.factor(complete_garment_data1$dayWednesday)
```

# Regularized Linear Regression

In this section, we do regularized linear regression. We do lasso, ridge, and elastic net regularization. For each regularization technique, we conduct coefficient path analysis to see how well each predictor variable shrinks. Then we use cross-validation to determine the optimal regularization parameter for each. Lastly, we generate the final model based on the optimal regularization parameters.

## Coefficient Path Analysis

Below, we split the data into our training and testing. Then we perform lasso, ridge, and elastic net.

```{r}
set.seed(112233)

X <- model.matrix(~ department+team+dayMonday+daySaturday+ daySunday+ dayThursday+ dayTuesday+ dayWednesday, data = complete_garment_data1)[, -1]
X <- cbind(complete_garment_data1$targeted_productivity, complete_garment_data1$smv, complete_garment_data1$wip, complete_garment_data1$over_time, complete_garment_data1$incentive, complete_garment_data1$idle_time, complete_garment_data1$idle_men, complete_garment_data1$no_of_style_change, complete_garment_data1$no_of_workers, X)


colnames(X) <- c('targeted_productivity','smv', 'wip', 
                       'over_time', 
                       'incentive', 'idle_time' , 'idle_men',
                       'no_of_style_change', 'no_of_workers','department',
                       'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12','dayMonday','daySaturday','daySunday','dayThursday','dayTuesday','dayWednesday')

y <- complete_garment_data1$actual_productivity

train_index <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]

preprocess_params <- preProcess(X_train, method = c("center", "scale"))
X_train <- predict(preprocess_params, X_train)
X_test <- predict(preprocess_params, X_test)

fit_lasso <- glmnet(X_train, 
                      y_train, 
                      alpha = 1)         
fit_ridge <- glmnet(X_train, 
                    y_train, 
                    alpha = 0)          
fit_elastic_net <- glmnet(X_train, 
                          y_train, 
                          alpha = 0.5) 

cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)  
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)  
cv_elastic_net <- cv.glmnet(X_train, y_train, alpha = 0.5) 
```

Below shows the plot of the coefficient paths for all the predictor variables for lasso. It shows how variables are shrunk to exactly 0.

```{r}
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(fit_lasso, xvar = "lambda", label = TRUE,
     lwd = 1.5,
     main = "Coefficient Path Analysis: LASSO",
     cex.main = 0.9,
     col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)
```

Below shows the plot of the coefficient paths for all the predictor variables for ridge. It shows how variables are shrunk towards 0.

```{r}
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(fit_ridge, xvar = "lambda", label = TRUE,
     lwd = 1.5,
     main = "Coefficient Path Analysis: ridge",
     cex.main = 0.9,
     col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)
```

Below shows the plot of the coefficient paths for all the predictor variables for elastic net.

```{r}
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(fit_elastic_net, xvar = "lambda", label = TRUE,
     lwd = 1.5,
     main = "Coefficient Path Analysis: elastic net",
     cex.main = 0.9,
     col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)
```


## Optimal Parameters

As shown below, using the cross validation we did for glmnet and the lambda.min (minimum cross-validated error) generated, we get the optimal parameters for lasso, ridge, and elastic net. LASSO.opt = 0.1328, Ridge.opt=0.1334, and Elasticnet.opt = 0.1329

```{r}
best.lasso.lambda <- cv_lasso$lambda.min
best.ridge.lambda <- cv_ridge$lambda.min
best.elastic.net.lambda <- cv_elastic_net$lambda.min
##
# Lasso Regression (L1 Regularization): 
# CAUTION: model formula differs from the regular regression formula 
lasso_model.opt <- glmnet(X_train, 
                      y_train, 
                      alpha = 1,      # lasso regression 
                      lambda = best.lasso.lambda)   # useser selected alpha, optimal lambda
                                      # can be obtained through CV (see below)
lasso_predictions.opt <- predict(lasso_model.opt, 
                             s = best.lasso.lambda, # user selected lambda value 
                                      # (regularization paremeter)
                             newx = X_test)  # test data set 
# The following RMSE of prediction serves as a validation - one step validation
lasso_rmse.opt <- sqrt(mean((y_test - lasso_predictions.opt)^2))   
                                          
# Ridge Regression (L2 Regularization)
ridge_model.opt <- glmnet(X_train, y_train, alpha = 0, lambda = best.ridge.lambda)
ridge_predictions.opt <- predict(ridge_model.opt, s = best.ridge.lambda, newx = X_test)
ridge_rmse.opt <- sqrt(mean((y_test - ridge_predictions.opt)^2))

# Elastic Net (Combination of L1 and L2)
elastic_net_model.opt <- glmnet(X_train, y_train, alpha = 0.5, lambda = best.elastic.net.lambda)
elastic_net_predictions.opt <- predict(elastic_net_model.opt, s = 0.1, newx = X_test)
elastic_net_rmse.opt <- sqrt(mean((y_test - elastic_net_predictions.opt)^2))

RMSE.opt = cbind(LASSO.opt = lasso_rmse.opt, 
                 Ridge.opt =  ridge_rmse.opt, 
                 Elasticnet.opt = elastic_net_rmse.opt)
pander(RMSE.opt)
```

## Final Model Equations

For lasso, we get the final model equation below based on the optimal parameter from the previous section.

```{r}
# lasso

best_lambda.lasso <- cv_lasso$lambda.min
coefficients.lasso <- coef(cv_lasso, s = best_lambda.lasso)

intercept.lasso <- coefficients.lasso[1]
betas.lasso <- coefficients.lasso[-1]

cat("Model equation: y =", round(intercept.lasso,4), "+", 
paste(round(betas.lasso,4), colnames(X), sep = "*", collapse = " + "), "\n")
```

For ridge, we get the final model equation below based on the optimal parameter from the previous section.

```{r}
# ridge

best_lambda.ridge <- cv_ridge$lambda.min
coefficients.ridge <- coef(cv_ridge, s = best_lambda.ridge)

intercept.ridge <- coefficients.ridge[1]
betas.ridge <- coefficients.ridge[-1]

cat("Model equation: y =", round(intercept.ridge,4), "+", 
paste(round(betas.ridge,4), colnames(X), sep = "*", collapse = " + "), "\n")
```

For elastic net, we get the final model equation below based on the optimal parameter from the previous section.

```{r}
# elastic net

best_lambda.net <- cv_elastic_net$lambda.min
coefficients.net <- coef(cv_elastic_net, s = best_lambda.net)

intercept.net <- coefficients.net[1]
betas.net<- coefficients.net[-1]

cat("Model equation: y =", round(intercept.net,4), "+", 
 paste(round(betas.net,4), colnames(X), sep = "*", collapse = " + "), "\n")
```

# Regularized Logistic Regression

## Coefficient Path Analysis

In this section, we do regularized logistic regression. The process is the same as regularized linear.

First, we create a binary response variable called 'productivity', where 1 represents productive and 0 represents not productive. This is based on if actual productivity level is greater than the targeted productivity.

Then we split the data into train/test, generate the lasso, ridge, and elastic net models, and generate our optimal lambda values.

```{r}



complete_garment_data1 <- transform(complete_garment_data1, productivity=ifelse(actual_productivity>=targeted_productivity, 1, 0))







set.seed(80)

X1 <- model.matrix(~ department+team+dayMonday+daySaturday+ daySunday+ dayThursday+ dayTuesday+ dayWednesday, data = complete_garment_data1)[, -1]
X1 <- cbind(complete_garment_data1$targeted_productivity, complete_garment_data1$smv, complete_garment_data1$wip, complete_garment_data1$over_time, complete_garment_data1$incentive, complete_garment_data1$idle_time, complete_garment_data1$idle_men, complete_garment_data1$no_of_style_change, complete_garment_data1$no_of_workers, X1)
#X <- as.data.frame(X)

colnames(X1) <- c('targeted_productivity','smv', 'wip', 
                       'over_time', 
                       'incentive', 'idle_time' , 'idle_men',
                       'no_of_style_change', 'no_of_workers','department',
                       'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12','dayMonday','daySaturday','daySunday','dayThursday','dayTuesday','dayWednesday')

#X1 <- as.matrix(complete_garment_data1[, -c(1,2,4,15,16)])
y1 <- complete_garment_data1$productivity

train_index1 <- createDataPartition(y1, p = 0.8, list = FALSE)
X_train1 <- X1[train_index1, ]
X_test1 <- X1[-train_index1, ]
y_train1 <- y1[train_index1]
y_test1 <- y1[-train_index1]

 preprocess_params1 <- preProcess(X_train1, method = c("center", "scale"))
 X_train1 <- predict(preprocess_params1, X_train1)
 X_test1 <- predict(preprocess_params1, X_test1)
 
 
lasso_model1 <- glmnet(X_train1, y_train1, family = "binomial", alpha = 1)


cv_lasso1 <- cv.glmnet(X_train1, y_train1, family = "binomial", alpha = 1)

lambda_lasso1 <- cv_lasso1$lambda.min


lasso_model_opt1 <- glmnet(X_train1, y_train1, 
                          family = "binomial", 
                          alpha = 1, 
                          lambda = lambda_lasso1)
 
 

```

```{r}
ridge_model1 <- glmnet(X_train1, y_train1, family = "binomial", alpha = 0)


cv_ridge1 <- cv.glmnet(X_train1, y_train1, family = "binomial", alpha = 0)


lambda_ridge1 <- cv_ridge1$lambda.min


ridge_model_opt1 <- glmnet(X_train1, y_train1, 
                          family = "binomial", 
                          alpha = 0, 
                          lambda = lambda_ridge1)



elastic_model1 <- glmnet(X_train1, y_train1, family = "binomial", alpha = 0.5)


cv_elastic1 <- cv.glmnet(X_train1, y_train1, family = "binomial", alpha = 0.5)


lambda_elastic1 <- cv_elastic1$lambda.min


elastic_model_opt1 <- glmnet(X_train1, y_train1, 
                            family = "binomial", 
                            alpha = 0.5, 
                            lambda = lambda_elastic1)
```

Below shows the plot of the coefficient paths for all the predictor variables for lasso.

```{r}
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(lasso_model1, xvar = "lambda", label = TRUE,
     lwd = 1.5,
     main = "Coefficient Path Analysis: lasso",
     cex.main = 0.9,
     col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)
```

Below shows the plot of the coefficient paths for all the predictor variables for ridge.

```{r}
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(ridge_model1, xvar = "lambda", label = TRUE,
     lwd = 1.5,
     main = "Coefficient Path Analysis: ridge",
     cex.main = 0.9,
     col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)
```

Below shows the plot of the coefficient paths for all the predictor variables for elastic net.

```{r}
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(elastic_model1, xvar = "lambda", label = TRUE,
     lwd = 1.5,
     main = "Coefficient Path Analysis: elastic net",
     cex.main = 0.9,
     col = rainbow(10))
abline(v = 1, col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)
```

## Regularization Parameters

Below, we show the coefficients for each predictor variable based on the optimal lambda values (for lasso/ridge/elastic).

```{r}
lasso.coef1 <- as.matrix(coef(lasso_model_opt1))
ridge.coef1 <- as.matrix(coef(ridge_model_opt1))
elastic.coef1 <- as.matrix(coef(elastic_model_opt1))
regularized.coef1 <- data.frame(lasso = lasso.coef1[,1],
                               ridge = ridge.coef1[,1],
                          elasticnet = elastic.coef1[,1])
pander(regularized.coef1)
```

## Optimal Cutoff and ROC/AUC

The below figure shows the optimal cut-off probability for the lasso/ridge/elastic net models. This will allow us to use the fitted model for predicting the response variable (productivity). All the optimal cut-off probabilities are approx 0.55 

```{r}
predict_lasso <- predict(lasso_model_opt1, newx = X_test1, type = "response")
predict_ridge <- predict(ridge_model_opt1, newx = X_test1, type = "response")
predict_elastic <- predict(elastic_model_opt1, newx = X_test1, type = "response")

###########################################
## Optimal cutoff probability determination
seq.cut <- seq(0,1, length=50)
# y is a vector of 0 and 1
acc.lasso <- NULL
acc.ridge <- NULL
acc.elastic <- NULL
for (i in 1:length(seq.cut)){
   predy.lasso <- ifelse(predict_lasso >seq.cut[i], 1, 0)
   predy.ridge<- ifelse(predict_ridge >seq.cut[i], 1, 0)
   predy.elastic<- ifelse(predict_elastic >seq.cut[i], 1, 0)
   ##
   acc.lasso[i] <- mean(y_test1  == predy.lasso)
   acc.ridge[i] <- mean(y_test1 == predy.ridge)
   acc.elastic[i] <- mean(y_test1  == predy.elastic)
}
## optimal cut-off: if the maximum accuracy occurs at multiple
## cut-off probabilities, the average of these cutoff probabilities
## will be defined as the optimal cutoff probability
opt.cut.lasso <- mean(seq.cut[which(acc.lasso==max(acc.lasso))])
opt.cut.ridge<- mean(seq.cut[which(acc.ridge==max(acc.ridge))])
opt.cut.elastic <- mean(seq.cut[which(acc.elastic==max(acc.elastic))])
##
acc.data <- data.frame(prob = rep(seq.cut,3), 
                       acc=c(acc.lasso, acc.ridge, acc.elastic), 
                       group = c(rep("lasso",50), rep("ridge",50), rep("elastic",50)))
```

```{r}
##
gg.acc <- ggplot(data = acc.data, aes(x=prob, y = acc, color = group)) +
  geom_line() +
  annotate("text", x = 0.6, y = 0.45, 
           label = paste("LASSO cutoff: ", round(opt.cut.lasso,5), "Accuracy: ", round(max(acc.lasso),5), 
                         "\nRidge cutoff: ", round(opt.cut.ridge,5), "Accuracy: ", round(max(acc.ridge),5), 
                         "\nElastic cutoff: ", round(opt.cut.elastic,5), "Accuracy: ", round(max(acc.elastic),5)), 
           size = 3, 
           color = "navy") +
  ggtitle("Cut-off Probability vs Accuracy") +
  labs(x = "cut-off Probability", 
       y = "accuracy", color = "Group") +
  theme(plot.title = element_text(hjust = 0.5))

##
ggplotly(gg.acc)
```

We can compare the performance of the lasso, Ridge, and Elastic Net models on our test data using ROC curves. All models have similar performances with AUC of approx 0.75.

```{r}
prob_lasso <- predict(lasso_model_opt1, newx = X_test1, type = "response")
prob_ridge <- predict(ridge_model_opt1, newx = X_test1, type = "response")
prob_elastic <- predict(elastic_model_opt1, newx = X_test1, type = "response")

# Compute ROC curves: roc object contains a lot information including
# sensitivity, specificity, AUC, etc.
roc_lasso <- roc(y_test1, prob_lasso)
roc_ridge <- roc(y_test1, prob_ridge)
roc_elastic <- roc(y_test1, prob_elastic)

# Compute AUC values
auc_lasso <- auc(roc_lasso)
auc_ridge <- auc(roc_ridge)
auc_elastic <- auc(roc_elastic)

## LASSO
sen.lasso <- roc_lasso$sensitivities
spe.lasso <- roc_lasso$specificities
auc.lasso <- roc_lasso$auc

## Ridge
sen.ridge <- roc_ridge$sensitivities
spe.ridge <- roc_ridge$specificities
auc.ridge <- roc_ridge$auc

## Elastic Net
sen.elastic <- roc_elastic$sensitivities
spe.elastic <- roc_elastic$specificities
auc.elastic <- roc_elastic$auc

## Plotting the ROC curves: three colors - green, orange, and purple

plot(1-spe.lasso, sen.lasso, 
     type = "l",
     col = "green", 
     xlim=c(0,1),
     xlab = "1 - specificity",
     ylab = "sensitivity",
     main = "ROC Curves for LASSO, Ridge, and Elastic Net")
lines(1-spe.ridge, sen.ridge, col = "orange")
lines(1-spe.elastic, sen.elastic, col = "purple")
#abline(0,1, type = "l", lty = 2, col = "steelblue", lwd = 1)
abline(0,1, , lty = 2, col = "steelblue", lwd = 1)
# Add legend
legend("bottomright", legend = c(paste("LASSO (AUC =", round(auc_lasso, 3), ")"),
                                paste("Ridge (AUC =", round(auc_ridge, 3), ")"),
                                paste("Elastic Net (AUC =", round(auc_elastic, 3), ")")),
       col = c("green", "orange", "purple"), lty = 1, cex = 0.8, bty = "n")
```

# SVM

This section uses Support Vector Machines. Since our target variable (productivity) has two classes, SVM will find a hyperplane that maximizes the margin between the two classes.

The below code chunks will perform SVM by using 5-fold cross validation to search for the optimal hyperparameters and using those to train the final model. We will use linear and RBF kernels.

```{r}
complete_garment_data1$productivity <- as.factor(complete_garment_data1$productivity)

set.seed(123)    
index <- sample(1:nrow(complete_garment_data1), 0.8 * nrow(complete_garment_data1))
train.data <- complete_garment_data1[index, ]
test.data <- complete_garment_data1[-index, ]

tune_control <- tune.control(
  cross = 5,  
  nrepeat = 1 
)

tune.RBF <- tune(
  svm,          
  productivity ~ department + team + targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men +
    no_of_style_change + no_of_workers + dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday, 
  data = train.data,
  kernel = "radial",    
  ranges = list(
    cost = 10^(-1:2),   
    gamma = c(0.1, 0.5, 1, 2) 
  ),
  tunecontrol = tune_control  
)


best.RBF <- tune.RBF$best.model
best.cost.RBF <- best.RBF$cost
best.gamma.RBF <- best.RBF$gamma


final.RBF <- svm(
  productivity ~ department + team + targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men +
    no_of_style_change + no_of_workers + dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday,
  data = train.data,
  kernel = "radial",
  cost = best.cost.RBF,
  gamma = best.gamma.RBF,
  probability = TRUE
)


##################

# tune.control <- tune.control(
#   cross = 5,  # Use 5-fold cross-validation, the default is 10-fold cross-validation
#   nrepeat = 1 # Number of repetitions (for repeated cross-validation)
# )

tune.lin <- tune(
  svm,          
  productivity ~ department + team + targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men +
    no_of_style_change + no_of_workers + dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday, 
  data = train.data,
  kernel = "linear",    
  ranges = list(
    cost = 10^(-1:2)   
  ),
  tunecontrol = tune_control  
)

best.lin <- tune.lin$best.model
best.cost.lin <- best.lin$cost

final.lin <- svm(
  productivity ~ department + team + targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men +
    no_of_style_change + no_of_workers + dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday,
  data = train.data,
  kernel = "linear",
  cost = best.cost.lin,
  probability = TRUE
)
```

```{r}
logit.fit <- glm(productivity ~ department + team + targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men + no_of_style_change + no_of_workers+ dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday, data = train.data, family = binomial)
AIC.logit <- step(logit.fit, direction = "both", trace = 0)
pred.logit <- predict(AIC.logit, test.data, type = "response")


pred.prob.lin <- predict(final.lin, test.data, probability = TRUE)
pred.prob.RBF <- predict(final.RBF, test.data, probability = TRUE)

#####
prob.linear <- attr(pred.prob.lin, "probabilities")[, 2]
prob.radial <- attr(pred.prob.RBF, "probabilities")[, 2]
#####

roc_lin <- roc(test.data$productivity, prob.linear)
roc_RBF <- roc(test.data$productivity, prob.radial)
roc_logit <- roc(test.data$productivity, pred.logit)

lin.sen <- roc_lin$sensitivities
lin.spe <- roc_lin$specificities
rad.sen <- roc_RBF$sensitivities
rad.spe <- roc_RBF$specificities
logit.sen <- roc_logit$sensitivities
logit.spe <- roc_logit$specificities
auc.lin <- roc_lin$auc
auc.rad <- roc_RBF$auc
auc.logit <- roc_logit$auc
```

Below we plot the ROC for linear, RBF, and standard logistic regression. Based on the AUC, the SVM linear, SVM RBF, and standard logistic are similar with area of approx 0.8.

```{r}
plot(1-lin.spe, lin.sen,  
     xlab = "1 - specificity",
     ylab = "sensitivity",
     col = "darkred",
     type = "l",
     lty = 1,
     lwd = 1,
     main = "ROC Curves of SVM")
lines(1-rad.spe, rad.sen, 
      col = "blue",
      lty = 1,
      lwd = 1)
lines(1-logit.spe, logit.sen,      
      col = "orange",
      lty = 1,
      lwd = 1)
abline(0,1, col = "skyblue3", lty = 2, lwd = 2)
abline(v=c(0.049,0.151), lty = 3, col = "darkgreen")
legend("bottomright", c("Linear Kernel", "Radial Kernel", "Logistic Regression"),
       lty = c(1,1,1), lwd = rep(1,3),
       col = c("red", "blue", "orange"),
       bty="n",cex = 0.8)

text(0.8, 0.46, paste("Linear AUC: ", round(auc.lin,4)), cex = 0.8)
text(0.8, 0.4, paste("Radial AUC: ", round(auc.rad,4)), cex = 0.8)
text(0.8, 0.34, paste("Logistic AUC: ", round(auc.logit,4)), cex = 0.8)
```


# SVR

In this section, we will do SVR. Since we have a continuous target variable, we cannot use a hyperplane to separate classes. Here we try to fit the predicted data within a specified margin of error. Below we train the SVR model. We use RBF/linear kernels and standard linear regression.

```{r}

X2 <- model.matrix(~ department+team+ dayMonday + daySaturday+ daySunday + dayThursday + dayTuesday + dayWednesday, data = complete_garment_data1)[, -1]
X2 <- cbind(complete_garment_data1$targeted_productivity, complete_garment_data1$smv, complete_garment_data1$wip, complete_garment_data1$over_time, complete_garment_data1$incentive, complete_garment_data1$idle_time, complete_garment_data1$idle_men, complete_garment_data1$no_of_style_change, complete_garment_data1$no_of_workers, X2)
#X <- as.data.frame(X)

colnames(X2) <- c('targeted_productivity','smv', 'wip', 
                       'over_time', 
                       'incentive', 'idle_time' , 'idle_men',
                       'no_of_style_change', 'no_of_workers','department',
                       'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12','dayMonday','daySaturday','daySunday','dayThursday','dayTuesday','dayWednesday')

 X2 <- as.data.frame(X2)
# 
#  X2 <- cbind(complete_garment_data1[, 15] ,X2)
# # 
# # 
#  colnames(X2) <- c('actual_productivity', 'targeted_productivity','smv', 'wip', 
#                         'over_time', 
#                         'incentive', 'idle_time' , 'idle_men',
#                         'no_of_style_change', 'no_of_workers','department',
#                        'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12')

#X2 <- complete_garment_data1[, -c(1,2,4,15,16)] 
y2 <- complete_garment_data1[, 15] 

set.seed(123)
train.index2 <- sample(1:nrow(X2), 0.8 * nrow(X2))
X.train2 <- X2[train.index2, ]
y.train2 <- y2[train.index2]
X.test2 <- X2[-train.index2, ]
y.test2 <- y2[-train.index2]

tune.RBF2 <- tune(svm, train.x = X.train2, train.y = y.train2, 
                    ranges = list(epsilon = seq(0.1, 0.5, 0.1), 
                                  cost = c(1, 10, 100), 
                                  gamma = c(0.01, 0.1, 1)), # Hyperpar in RBF
                    tunecontrol = tune_control
                    )

final.RBF2 <- svm(X.train2, y.train2, 
                   type = "eps-regression",  # Use "nu-regression" for nu-SVR
                   kernel = "radial", 
                   epsilon = tune.RBF2$best.parameters$epsilon, 
                   cost = tune.RBF2$best.parameters$cost, 
                   gamma = tune.RBF2$best.parameters$gamma)

pred.RBF <- predict(final.RBF2, X.test2)

# Evaluate performance
mse.RBF <- mean((y.test2 - pred.RBF)^2)    # mean square error
mae.RBF <- mean(abs(y.test2 - pred.RBF))



#############################################################

tune.lin2 <- tune(svm, train.x = X.train2, train.y = y.train2, 
                    ranges = list(epsilon = seq(0.1, 0.5, 0.1), 
                                  cost = c(1, 10, 100)), 
                    tunecontrol = tune_control
                                               
                    )


final.lin2 <- svm(X.train2, y.train2, 
                   type = "eps-regression",  
                   kernel = "linear", 
                   epsilon = tune.lin2$best.parameters$epsilon, 
                   cost = tune.lin2$best.parameters$cost)

pred.lin <- predict(final.lin2, X.test2)

# Evaluate performance
mse.lin <- mean((y.test2 - pred.lin)^2)    # mean square error
mae.lin <- mean(abs(y.test2 - pred.lin))
```

```{r}

X2 <- cbind(complete_garment_data1[, 15] ,X2)


colnames(X2) <- c('actual_productivity', 'targeted_productivity','smv', 'wip',
                       'over_time',
                       'incentive', 'idle_time' , 'idle_men',
                       'no_of_style_change', 'no_of_workers','department',
                       'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12','dayMonday','daySaturday','daySunday','dayThursday','dayTuesday','dayWednesday')

garment.train <- X2[train.index2, ]
garment.test <- X2[-train.index2, ]

# X3 <- cbind(complete_garment_data1[, 15] ,X3)
# 
# 
# colnames(X3) <- c('actual_productivity', 'targeted_productivity','smv', 'wip', 
#                        'over_time', 
#                        'incentive', 'idle_time' , 'idle_men',
#                        'no_of_style_change', 'no_of_workers','department',
#                        'team2','team3','team4','team5','team6', 'team7','team8','team9','team10','team11','team12')

lse.fit <- lm(actual_productivity ~ targeted_productivity + smv + wip + over_time + incentive + idle_time + idle_men + no_of_style_change + no_of_workers + department + team2+ team3+ team4+ team5+ team6+ team7+ team8+ team9+ team10+ team11+ team12 + dayMonday + daySaturday + daySunday + dayThursday + dayTuesday + dayWednesday,data=garment.train)
AIC.fit <- stepAIC(lse.fit,direction="both", trace = FALSE)



pred.lse <- predict(AIC.fit, X.test2)
mse.lse <- mean((y.test2 - pred.lse)^2)    # mean square error
mae.lse <- mean(abs(y.test2 - pred.lse))   # mean absolute error
###
par(mfrow=c(2,2), mar=c(2,2,2,2))
plot(AIC.fit)


```

The above shows the residual plots when we did the OLS regression. The residuals vs fitted plot does not seem to show any curve patterns so no transformation is necessary.

Below we compare the performances of SVR Linear, SVR RBF, and OLS regression. The MSE and MAE values are all similar, with RBF kernel doing the best.

```{r}
Performance <- data.frame(RBF.SVR=c(mse.RBF, mae.RBF),
                          Linear.SVR = c(mse.lin, mae.lin),
                          LSE.Reg =c(mse.lse, mae.lse))
row.names(Performance) <- c("MSE", "MAE")
##
pander(Performance)
```

# Results/Conclusion

For regularized linear regression, the optimal parameters for lasso is 0.1328, 0.1334 for ridge, and 0.1329 for elastic net. These RMSE values are based on lambda.min (value of lambda that gives the minimum cross-validated error). Based on these values, we get this equation for lasso:

Model equation: y = 0.7338 + 0.0661*targeted_productivity + -0.0788*smv + 0.0059*wip + -0.0128*over_time + 0.0071*incentive + 0.0053*idle_time + -0.0293*idle_men + -0.0168*no_of_style_change + 0.1039*no_of_workers + 0.0224*department + -0.0145*team2 + 9e-04*team3 + -0.0049*team4 + -0.0113*team5 + -0.0195*team6 + -0.0273*team7 + -0.0256*team8 + -0.0253*team9 + -0.0228*team10 + -0.0366*team11 + -0.0122*team12 + -0.0049*dayMonday + 0.0029*daySaturday + -0.0035*daySunday + -0.0059*dayThursday + 0.0039*dayTuesday + 0*dayWednesday 


This equation for ridge:

Model equation: y = 0.7338 + 0.0637*targeted_productivity + -0.0532*smv + 0.0072*wip + -0.0083*over_time + 0.0077*incentive + 0.0045*idle_time + -0.0277*idle_men + -0.0164*no_of_style_change + 0.0564*no_of_workers + 0.0042*department + -0.0106*team2 + 0.005*team3 + -0.0013*team4 + -0.0085*team5 + -0.0193*team6 + -0.0242*team7 + -0.0224*team8 + -0.0212*team9 + -0.0187*team10 + -0.0293*team11 + -0.0117*team12 + -0.0038*dayMonday + 0.0041*daySaturday + -0.0028*daySunday + -0.0047*dayThursday + 0.0052*dayTuesday + 0.0019*dayWednesday 


This equation for elastic net:

Model equation: y = 0.7338 + 0.066*targeted_productivity + -0.076*smv + 0.006*wip + -0.012*over_time + 0.0071*incentive + 0.0051*idle_time + -0.029*idle_men + -0.0166*no_of_style_change + 0.0965*no_of_workers + 0.0185*department + -0.0135*team2 + 0.0018*team3 + -0.0038*team4 + -0.0104*team5 + -0.0193*team6 + -0.0265*team7 + -0.0248*team8 + -0.0243*team9 + -0.0219*team10 + -0.0353*team11 + -0.0119*team12 + -0.004*dayMonday + 0.0035*daySaturday + -0.0028*daySunday + -0.0051*dayThursday + 0.0045*dayTuesday + 8e-04*dayWednesday 

For lasso, ridge, and elastic in regularized logistic regression we get the coefficients shown in 'regularized.coef1'. When we compare the performance of the lasso, Ridge, and Elastic Net models on our test data using ROC curves, we see that all models have similar performances with AUC of  0.762 for lasso, 0.757 for ridge, and 0.758 for elastic.

Since the optimal parameter was the lowest in regularized linear and the AUC was highest in regularized logistic regression, a lasso model is recommended. Lasso is preferred because it performs feature selection (shrinking coefficients to 0), easier to interpret, and better for High-dimensional data.

For SVM, the AUC for SVM (linear kernel) is 0.789, 0.8193 for SVM (RBF kernel), and 0.79 for standard stepwise logistic regression. Overall, since the AUC for SVM (both linear/RBF) are greater than the AUCs in the regularized logistic regression, we can assume that SVM is better because they are robust to overfitting, can use kernels, and can handle linear/nonlinear data.

For SVR, SVR RBF has an MSE of 0.01842, 0.02233 for SVR Linear, and 0.02067 for OLS regression. Since the MSE values are lower than that in regularized linear regression, we can assume that SVR is better.




