Introduction & Data

Image from CNN

The dataset is from UC Irvine Machine Learning Repository and has various information regarding the productivity of employees in a garment manufacturing companies. Originally, the data was published in the International Journal of Business Intelligence and Data Mining.

The variables I will use and their definitions:

no_of_workers - Number of workers of each team.
no_of_style_change - Number of changes in the style of a particular product.
targeted_productivity - targeted productivity set by the Authority for each team for each day.
over_time - Represents the amount of the overtime by each team in minutes.
incentive - Represents the amount of financial incentive in BDT(Bangladeshi Taka)
idle_time - The amount of time when the production was interrupted due to several reasons.
actual_productivity - The actual percentage of productivity that was delivered by the workers.

Research Question - Which factors best predict actual productivity among garment manufacturing workers?

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(highcharter)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(RColorBrewer)
setwd("C:/Users/tonge/Downloads")
productivity <- read_csv("garments_worker_productivity.csv")

## Rows: 1197 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): date, quarter, department, day
## dbl (11): team, targeted_productivity, smv, wip, over_time, incentive, idle_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(productivity)

## # A tibble: 6 × 15
##   date     quarter  department day       team targeted_productivity   smv   wip
##   <chr>    <chr>    <chr>      <chr>    <dbl>                 <dbl> <dbl> <dbl>
## 1 1/1/2015 Quarter1 sweing     Thursday     8                  0.8  26.2   1108
## 2 1/1/2015 Quarter1 finishing  Thursday     1                  0.75  3.94    NA
## 3 1/1/2015 Quarter1 sweing     Thursday    11                  0.8  11.4    968
## 4 1/1/2015 Quarter1 sweing     Thursday    12                  0.8  11.4    968
## 5 1/1/2015 Quarter1 sweing     Thursday     6                  0.8  25.9   1170
## 6 1/1/2015 Quarter1 sweing     Thursday     7                  0.8  25.9    984
## # ℹ 7 more variables: over_time <dbl>, incentive <dbl>, idle_time <dbl>,
## #   idle_men <dbl>, no_of_style_change <dbl>, no_of_workers <dbl>,
## #   actual_productivity <dbl>

str(productivity)

## spc_tbl_ [1,197 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date                 : chr [1:1197] "1/1/2015" "1/1/2015" "1/1/2015" "1/1/2015" ...
##  $ quarter              : chr [1:1197] "Quarter1" "Quarter1" "Quarter1" "Quarter1" ...
##  $ department           : chr [1:1197] "sweing" "finishing" "sweing" "sweing" ...
##  $ day                  : chr [1:1197] "Thursday" "Thursday" "Thursday" "Thursday" ...
##  $ team                 : num [1:1197] 8 1 11 12 6 7 2 3 2 1 ...
##  $ targeted_productivity: num [1:1197] 0.8 0.75 0.8 0.8 0.8 0.8 0.75 0.75 0.75 0.75 ...
##  $ smv                  : num [1:1197] 26.16 3.94 11.41 11.41 25.9 ...
##  $ wip                  : num [1:1197] 1108 NA 968 968 1170 ...
##  $ over_time            : num [1:1197] 7080 960 3660 3660 1920 6720 960 6900 6000 6900 ...
##  $ incentive            : num [1:1197] 98 0 50 50 50 38 0 45 34 45 ...
##  $ idle_time            : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
##  $ idle_men             : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_style_change   : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_workers        : num [1:1197] 59 8 30.5 30.5 56 56 8 57.5 55 57.5 ...
##  $ actual_productivity  : num [1:1197] 0.941 0.886 0.801 0.801 0.8 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_character(),
##   ..   quarter = col_character(),
##   ..   department = col_character(),
##   ..   day = col_character(),
##   ..   team = col_double(),
##   ..   targeted_productivity = col_double(),
##   ..   smv = col_double(),
##   ..   wip = col_double(),
##   ..   over_time = col_double(),
##   ..   incentive = col_double(),
##   ..   idle_time = col_double(),
##   ..   idle_men = col_double(),
##   ..   no_of_style_change = col_double(),
##   ..   no_of_workers = col_double(),
##   ..   actual_productivity = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Cleaning

No NAs

colSums(is.na(productivity))

##                  date               quarter            department 
##                     0                     0                     0 
##                   day                  team targeted_productivity 
##                     0                     0                     0 
##                   smv                   wip             over_time 
##                     0                   506                     0 
##             incentive             idle_time              idle_men 
##                     0                     0                     0 
##    no_of_style_change         no_of_workers   actual_productivity 
##                     0                     0                     0

Exploratory Plots

Percentage of Productivity based on day of the week

Each day of the week fluctuates similarly. The most productive day seems to be Sunday, but they are all pretty close.

highchart() |>
  hc_add_series(data = productivity,
                type = "line",
                hcaes(y = actual_productivity,
                      group = day)) |>
  hc_yAxis(title = list(text = "Percent of Actual Work Productivity"))

Histogram of the actual productivity using R Shiny -

Slightly left skewed, the company mostly experiences a high productivity numbers. The most frequent productivity number is around 80%

library(shiny)

# Define UI for app that draws a histogram ----
ui <- fluidPage(
  
  # App title ----
  titlePanel("Hello Shiny!"),
  
  # Sidebar layout with input and output definitions ----
  sidebarLayout(
    
    # Sidebar panel for inputs ----
    sidebarPanel(
      
      # Input: Slider for the number of bins ----
      sliderInput(inputId = "bins",
                  label = "Number of bins:",
                  min = 5,
                  max = 50,
                  value = 30)
      
    ),
    
    # Main panel for displaying outputs ----
    mainPanel(
      
      # Output: Histogram ----
      plotOutput(outputId = "distPlot")
      
    )
  )
)

server <- function(input, output) {
  
  # Histogram of the Old Faithful Geyser Data ----
  # with requested number of bins
  # This expression that generates a histogram is wrapped in a call
  # to renderPlot to indicate that:
  #
  # 1. It is "reactive" and therefore should be automatically
  #    re-executed when inputs (input$bins) change
  # 2. Its output type is a plot
  output$distPlot <- renderPlot({
    
    x    <- productivity$actual_productivity
    bins <- seq(min(x), max(x), length.out = input$bins + 1)
    
    hist(x, breaks = bins, col = "dodgerblue", border = "cyan",
         xlab = "Percentages of Productivity",
         main = "Histogram of Percentages of Productivity")
    
  })
  
}


shinyApp(ui = ui, server = server)

Shiny applications not supported in static R Markdown documents

Scatterplot testing if incentive has any effect on the productivity-

There is no clear upward or downward curve meaning a high financial incentive does not seem to result in more productivity. Low incentives experience a range of productivity levels. There are a few outliers around 3000.

p2 <- ggplot(productivity, aes(x = incentive, y = actual_productivity)) +
labs(title = "Scatterplot of Incentive and Productivity Percentages",
caption = "Data from the UC Irvine Machine Learning Repository",
x = "Amount of financial incentive (in BDT)",
y = "Percentage of Productivity") +
theme_minimal(base_size = 12)
p2 + geom_point()

Multiple Linear Regression

Adjusted R Squared : About 0.007 meaning about 0.7% of productivity variance is explained by this model

Number of workers and minutes worked over time have the least significant p values. I will be removing these variables and testing out new variables to see if my adjusted R squared will improve.

fit1 <- lm(actual_productivity ~ incentive + no_of_workers  + over_time, data = productivity)
summary(fit1)

## 
## Call:
## lm(formula = actual_productivity ~ incentive + no_of_workers + 
##     over_time, data = productivity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56467 -0.08050  0.02895  0.11920  0.37506 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.493e-01  9.495e-03  78.922  < 2e-16 ***
## incentive      8.588e-05  3.148e-05   2.728  0.00646 ** 
## no_of_workers -3.811e-04  3.345e-04  -1.139  0.25482    
## over_time     -9.499e-07  2.215e-06  -0.429  0.66809    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1738 on 1193 degrees of freedom
## Multiple R-squared:  0.009834,   Adjusted R-squared:  0.007344 
## F-statistic: 3.949 on 3 and 1193 DF,  p-value: 0.00812

Backwards Elimination

All the variables(incentive, number of workers, numbers of style changes in a product, targeted productivity, and idle time) have resulted in a statistically significant p value. The smallest p value(< 2e-16) is seen in targeted_productivity.

The adjust R squared has improved and is now 19.7%.Therefore, 19.7% of productivity variance is explained by this model

fit2 <- lm(actual_productivity ~ incentive + no_of_style_change  + targeted_productivity + idle_time, data = productivity)
summary(fit2)

## 
## Call:
## lm(formula = actual_productivity ~ incentive + no_of_style_change + 
##     targeted_productivity + idle_time, data = productivity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55175 -0.05208  0.00734  0.09007  0.51753 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.334e-01  3.521e-02   6.630 5.08e-11 ***
## incentive              6.505e-05  2.825e-05   2.303   0.0215 *  
## no_of_style_change    -5.090e-02  1.081e-02  -4.708 2.79e-06 ***
## targeted_productivity  6.955e-01  4.733e-02  14.693  < 2e-16 ***
## idle_time             -8.191e-04  3.564e-04  -2.298   0.0217 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1564 on 1192 degrees of freedom
## Multiple R-squared:  0.1998, Adjusted R-squared:  0.1971 
## F-statistic: 74.39 on 4 and 1192 DF,  p-value: < 2.2e-16

Conclusion - According to this dataset the strongest predictor of work productivity is the targeted productivity, which is set by the Authority for each team for each day. The coefficient(6.955e-01) is positive showing the higher the target set the more productivity.

References(APA)

Productivity Prediction of Garment Employees [Dataset]. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C51S6D.

Wright, R., & Saeed, S. (2020, April 22). Bangladeshi garment workers face ruin as global brands ditch clothing contracts amid coronavirus pandemic. CNN. https://www.cnn.com/2020/04/22/business/bangladesh-garment-factories

Work Productivity Data 110

S Tonge

2026-04-19