Generalized Linear Models GLMs for Garment Worker Productivity Part 1

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)

## Warning: package 'car' was built under R version 4.4.3

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

options(scipen = 6)
theme_set(theme_minimal())

Select an interesting binary column of data, or one which can be reasonably converted into a binary variable.

The data dive this week examines the causes of idle time in garment manufacturing. By using logistic regression, we focus on idle time (idle_time > 0) as our binary variable. We chose this variable because idle time directly shows where the production process may be lagging, highlighting areas of inefficiency. Analyzing how different factors like department, day of the week, number of workers, and overtime impact idle time helps us understand what influences productivity. Our goal is to identify strategies that not only reduce idle time but also enhance overall worker productivity, addressing crucial aspects of operational improvement in the garment industry.

data <- read.csv("C:/Users/rbada/Downloads/productivity+prediction+of+garment+employees/garments_worker_productivity.csv")

Build a logistic regression model for this variable, using between 1-4 explanatory variables

In this logistic regression model, I chose four variables that can help explain why idle time happens in garment manufacturing. First, Department is important because each department works differently, and some may have more delays than others. Number of Workers matters because having more people can help reduce idle time, while fewer workers might slow things down. Day of the Week is useful to see if idle time happens more often on certain days, like at the start or end of the week. Lastly, Overtime Hours can show if working extra hours helps get more done or just makes workers tired and less efficient. By looking at these four things, the model can help find ways to improve how work is done and reduce time wasted during production.

# Create binary column
data$idle_time_binary <- ifelse(data$idle_time > 0, 1, 0)

data <- data[complete.cases(data[, c("department", "day", "no_of_workers", "over_time")]), ]
data$department <- tolower(trimws(data$department))
data$day <- tolower(trimws(data$day))

data$department <- as.factor(data$department)
data$day <- as.factor(data$day)

table(data$department)

## 
## finishing    sweing 
##       506       691

table(data$day)

## 
##    monday  saturday    sunday  thursday   tuesday wednesday 
##       199       187       203       199       201       208

model <- glm(idle_time_binary ~ department + day + no_of_workers + over_time,
             data = data, family = "binomial")

summary(model)

## 
## Call:
## glm(formula = idle_time_binary ~ department + day + no_of_workers + 
##     over_time, family = "binomial", data = data)
## 
## Coefficients:
##                       Estimate    Std. Error z value  Pr(>|z|)    
## (Intercept)       -23.14369156 1268.20829598  -0.018    0.9854    
## departmentsweing   15.35243035 1268.20893519   0.012    0.9903    
## daysaturday         1.23314138    1.25678591   0.981    0.3265    
## daysunday           2.03245512    1.12922804   1.800    0.0719 .  
## daythursday         1.26874844    1.18076338   1.075    0.2826    
## daytuesday          1.11258027    1.17953506   0.943    0.3456    
## daywednesday        1.24291335    1.14717016   1.083    0.2786    
## no_of_workers       0.08769982    0.03640186   2.409    0.0160 *  
## over_time          -0.00034180    0.00008414  -4.062 0.0000486 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 186.83  on 1196  degrees of freedom
## Residual deviance: 142.63  on 1188  degrees of freedom
## AIC: 160.63
## 
## Number of Fisher Scoring iterations: 20

four explanatory variables were selected: department, day of the week, number of workers, and overtime hours — all of which could reasonably influence idle time in a garment factory setting. However, model results indicated that only no_of_workers and over_time were statistically significant. Since department and day did not show a strong impact, the analysis was refined to focus on the two most meaningful predictors, resulting in a simpler and more effective model. These findings suggest that improving coordination among larger teams and managing overtime more strategically may help reduce idle time in garment production.

library(ggplot2)

# Example for no_of_workers
ggplot(data, aes(x = no_of_workers, y = idle_time_binary)) +
  geom_jitter(height = 0.1, width = 0, alpha = 0.5) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  labs(title = "Idle Time vs. Number of Workers",
       x = "Number of Workers", y = "Idle Time (0 or 1)")

## `geom_smooth()` using formula = 'y ~ x'

The number of workers was plotted against idle time with a logistic regression curve added. The curve revealed a nonlinear upward trend, indicating that idle time becomes more likely as the number of workers increases. This suggests that having too many workers may lead to inefficiencies, emphasizing the importance of proper team sizing and effective task distribution.

data$sqrt_workers <- sqrt(data$no_of_workers)

p1 <- ggplot(data, aes(x = no_of_workers, y = idle_time_binary)) +
  geom_jitter(height = 0.1, alpha = 0.4) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  labs(title = "Original: No. of Workers vs. Idle Time")

p2 <- ggplot(data, aes(x = sqrt_workers, y = idle_time_binary)) +
  geom_jitter(height = 0.1, alpha = 0.4) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  labs(title = "Transformed: sqrt(No. of Workers) vs. Idle Time")

library(patchwork)
p1 + p2

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The original no_of_workers variable was compared to its square root transformation using logistic regression plots. The initial plot showed a sharp, nonlinear increase in idle time as the number of workers increased. After applying the square root transformation, the curve appeared smoother and more gradual. This suggests that the transformation improves the linearity of the relationship, making it easier to model and potentially enhancing prediction accuracy.

ggplot(data, aes(x = over_time, y = idle_time_binary)) +
  geom_jitter(height = 0.1, width = 0, alpha = 0.5) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  labs(title = "Idle Time vs. Overtime Hours",
       x = "Overtime", y = "Idle Time (0 or 1)")

## `geom_smooth()` using formula = 'y ~ x'

The relationship between overtime hours and idle time was plotted using a logistic regression curve. The curve appeared smooth and relatively flat, indicating a linear relationship on the log-odds scale. As a result, no transformation of the over_time variable was necessary, as it fit well within the model in its original form.

Interpret the coefficients, and explain what they mean in your notebook

Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and translate its meaning

model_transformed <- glm(idle_time_binary ~ sqrt_workers + over_time,
                         data = data, family = "binomial")
summary(model_transformed)

## 
## Call:
## glm(formula = idle_time_binary ~ sqrt_workers + over_time, family = "binomial", 
##     data = data)
## 
## Coefficients:
##                  Estimate   Std. Error z value  Pr(>|z|)    
## (Intercept)  -12.44433185   3.43223976  -3.626  0.000288 ***
## sqrt_workers   1.42994680   0.45977253   3.110  0.001870 ** 
## over_time     -0.00031471   0.00007852  -4.008 0.0000613 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 186.83  on 1196  degrees of freedom
## Residual deviance: 147.11  on 1194  degrees of freedom
## AIC: 153.11
## 
## Number of Fisher Scoring iterations: 10

confint(model_transformed, parm = "sqrt_workers")

## Waiting for profiling to be done...

##     2.5 %    97.5 % 
## 0.7740603 2.6144583

A logistic regression model was built using sqrt_workers and over_time to predict the probability of idle time. The coefficient for sqrt_workers was 1.43 with a p-value of 0.00187, indicating a statistically significant and positive effect. This suggests that as the square root of the number of workers increases, the likelihood of idle time also increases. The 95% confidence interval for the sqrt_workers coefficient was calculated as [0.774, 2.614]. Since the interval does not include 0, the effect is considered statistically significant and positive. This supports the interpretation that an increase in the number of workers may raise the likelihood of idle time, potentially due to coordination challenges or inefficient task distribution.

coef_value <- coef(model_transformed)["sqrt_workers"]
std_error <- summary(model_transformed)$coefficients["sqrt_workers", "Std. Error"]

lower_bound <- coef_value - 1.96 * std_error
upper_bound <- coef_value + 1.96 * std_error

cat("95% Confidence Interval for sqrt_workers:\n")

## 95% Confidence Interval for sqrt_workers:

cat("Lower bound:", round(lower_bound, 3), "\n")

## Lower bound: 0.529

cat("Upper bound:", round(upper_bound, 3), "\n")

## Upper bound: 2.331

The 95% confidence interval for the coefficient of sqrt_workers was calculated to be [0.529, 2.331], indicating that the true effect is likely to fall within this range. Since the entire interval is above 0, the relationship is considered statistically significant and positive. In other words, as the square root of the number of workers increases, idle time becomes more likely. This suggests that larger teams may contribute to more idle time, potentially due to inefficiencies in coordination or task distribution.

coef_over <- coef(model_transformed)["over_time"]
se_over <- summary(model_transformed)$coefficients["over_time", "Std. Error"]

lower_over <- coef_over - 1.96 * se_over
upper_over <- coef_over + 1.96 * se_over

cat("95% Confidence Interval for over_time:\n")

## 95% Confidence Interval for over_time:

cat("Lower bound:", round(lower_over, 6), "\n")

## Lower bound: -0.000469

cat("Upper bound:", round(upper_over, 6), "\n")

## Upper bound: -0.000161

calculated the 95% confidence interval for the coefficient of over_time and found it to be [-0.000469, -0.000161]. This means I am 95% confident that the true effect of over_time on the odds of idle time falls within this range. Since the entire interval is below 0, the effect is statistically significant and negative. This means that as overtime hours increase, the odds of idle time decrease. It suggests that teams working longer hours are more focused or productive, possibly using overtime to catch up on tasks and reduce delays.

library(ggplot2)
library(broom)
model_tidy <- tidy(model_transformed, conf.int = TRUE)

ggplot(model_tidy, aes(x = term, y = estimate)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  labs(title = "Coefficient Estimates with 95% Confidence Intervals",
       x = "Variable", y = "Estimate (Log Odds)") +
  theme_minimal()

This plot shows the coefficient estimates from the logistic regression model along with their 95% confidence intervals. Both sqrt_workers and over_time have intervals that do not cross 0, confirming their effects are statistically significant. The positive estimate for sqrt_workers indicates that larger teams increase the likelihood of idle time. The negative estimate for over_time shows that more overtime reduces the chance of idle time. Visualizing the confidence intervals helps clearly support the interpretation of the model results.

Conslution

This analysis applied logistic regression to explore factors contributing to idle time in garment production. Both the square root of the number of workers (sqrt_workers) and overtime hours (over_time) were identified as statistically significant predictors of idle time. A higher number of workers was associated with an increased likelihood of idle time, while greater overtime was linked to a decreased likelihood. A transformation was applied to no_of_workers to improve the linearity of the relationship, resulting in a better model fit. The final model results, supported by confidence intervals and visual plots, suggest that improving task coordination in larger teams and making effective use of overtime may help reduce idle time and enhance productivity in garment manufacturing.