Introduction & Data

Image of a garment factory in Bangladesh from CNN(Wright & Saeed)

The dataset is from UC Irvine Machine Learning Repository and has various information regarding the productivity of employees in a garment manufacturing companies. There is a total of 1167 observations, 4 categorical variables and 11 quantitative. Originally, the data was published in the International Journal of Business Intelligence and Data Mining. To clean this dataset, I checked for NA values. When I found there were none, I filtered the data to only include information from the sewing department which narrowed down the data to 691 observations. This data is important to me because I can learn what factors most positively effect worker output. This can help make management decisions like hiring more employees, raising the salary, or even reducing the number of employees to increase efficiency.

The variables I will use and their definitions:

Research Question - Which factors best predict actual productivity among sewing garment manufacturing workers?

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(RColorBrewer)
setwd("C:/Users/tonge/Downloads")
productivity <- read_csv("garments_worker_productivity.csv")
## Rows: 1197 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): date, quarter, department, day
## dbl (11): team, targeted_productivity, smv, wip, over_time, incentive, idle_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(productivity)
## # A tibble: 6 × 15
##   date     quarter  department day       team targeted_productivity   smv   wip
##   <chr>    <chr>    <chr>      <chr>    <dbl>                 <dbl> <dbl> <dbl>
## 1 1/1/2015 Quarter1 sweing     Thursday     8                  0.8  26.2   1108
## 2 1/1/2015 Quarter1 finishing  Thursday     1                  0.75  3.94    NA
## 3 1/1/2015 Quarter1 sweing     Thursday    11                  0.8  11.4    968
## 4 1/1/2015 Quarter1 sweing     Thursday    12                  0.8  11.4    968
## 5 1/1/2015 Quarter1 sweing     Thursday     6                  0.8  25.9   1170
## 6 1/1/2015 Quarter1 sweing     Thursday     7                  0.8  25.9    984
## # ℹ 7 more variables: over_time <dbl>, incentive <dbl>, idle_time <dbl>,
## #   idle_men <dbl>, no_of_style_change <dbl>, no_of_workers <dbl>,
## #   actual_productivity <dbl>
str(productivity)
## spc_tbl_ [1,197 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date                 : chr [1:1197] "1/1/2015" "1/1/2015" "1/1/2015" "1/1/2015" ...
##  $ quarter              : chr [1:1197] "Quarter1" "Quarter1" "Quarter1" "Quarter1" ...
##  $ department           : chr [1:1197] "sweing" "finishing" "sweing" "sweing" ...
##  $ day                  : chr [1:1197] "Thursday" "Thursday" "Thursday" "Thursday" ...
##  $ team                 : num [1:1197] 8 1 11 12 6 7 2 3 2 1 ...
##  $ targeted_productivity: num [1:1197] 0.8 0.75 0.8 0.8 0.8 0.8 0.75 0.75 0.75 0.75 ...
##  $ smv                  : num [1:1197] 26.16 3.94 11.41 11.41 25.9 ...
##  $ wip                  : num [1:1197] 1108 NA 968 968 1170 ...
##  $ over_time            : num [1:1197] 7080 960 3660 3660 1920 6720 960 6900 6000 6900 ...
##  $ incentive            : num [1:1197] 98 0 50 50 50 38 0 45 34 45 ...
##  $ idle_time            : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
##  $ idle_men             : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_style_change   : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_workers        : num [1:1197] 59 8 30.5 30.5 56 56 8 57.5 55 57.5 ...
##  $ actual_productivity  : num [1:1197] 0.941 0.886 0.801 0.801 0.8 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_character(),
##   ..   quarter = col_character(),
##   ..   department = col_character(),
##   ..   day = col_character(),
##   ..   team = col_double(),
##   ..   targeted_productivity = col_double(),
##   ..   smv = col_double(),
##   ..   wip = col_double(),
##   ..   over_time = col_double(),
##   ..   incentive = col_double(),
##   ..   idle_time = col_double(),
##   ..   idle_men = col_double(),
##   ..   no_of_style_change = col_double(),
##   ..   no_of_workers = col_double(),
##   ..   actual_productivity = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Cleaning

Cheching for NAs - No NAs in the variables I will be using

colSums(is.na(productivity))
##                  date               quarter            department 
##                     0                     0                     0 
##                   day                  team targeted_productivity 
##                     0                     0                     0 
##                   smv                   wip             over_time 
##                     0                   506                     0 
##             incentive             idle_time              idle_men 
##                     0                     0                     0 
##    no_of_style_change         no_of_workers   actual_productivity 
##                     0                     0                     0

Filtering to only include data on employees within the sewing department

productivity1 <- productivity %>%
  filter(department == "sweing")

Exploratory Plots

Percentage of Productivity based on day of the week

Each day of the week fluctuates similarly. The most productive day seems to be Thursday, but they are all pretty close.

highchart() |>
  hc_add_series(data = productivity1,
                type = "line",
                hcaes(y = actual_productivity,
                      group = day)) |>
  hc_yAxis(title = list(text = "Percent of Actual Work Productivity")) |>
           hc_caption(text = "Data from the UC Irvine Machine Learning Repository")

Histogram of the actual productivity using R Shiny -

Slightly left skewed, the company mostly experiences a high productivity numbers. The most frequent productivity number is around 80%.

library(shiny)

# Define UI for app that draws a histogram ----
ui <- fluidPage(
  
  # App title ----
  titlePanel("Hello Shiny!"),
  
  # Sidebar layout with input and output definitions ----
  sidebarLayout(
    
    # Sidebar panel for inputs ----
    sidebarPanel(
      
      # Input: Slider for the number of bins ----
      sliderInput(inputId = "bins",
                  label = "Number of bins:",
                  min = 5,
                  max = 50,
                  value = 30)
      
    ),
    
    # Main panel for displaying outputs ----
    mainPanel(
      
      # Output: Histogram ----
      plotOutput(outputId = "distPlot")
      
    )
  )
)
server <- function(input, output) {
  
  # Histogram of the Old Faithful Geyser Data ----
  # with requested number of bins
  # This expression that generates a histogram is wrapped in a call
  # to renderPlot to indicate that:
  #
  # 1. It is "reactive" and therefore should be automatically
  #    re-executed when inputs (input$bins) change
  # 2. Its output type is a plot
  output$distPlot <- renderPlot({
    
    x    <- productivity1$actual_productivity
    bins <- seq(min(x), max(x), length.out = input$bins + 1)
    
    hist(x, breaks = bins, col = "dodgerblue", border = "cyan",
         xlab = "Percentages of Productivity",
         main = "Histogram of Percentages of Productivity")
    
  })
  
}


shinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents

Scatterplot testing if financial incentive has any effect on the productivity-

Most of the points populate the low incentive amounts, showing that very few employees receive high incentives. However, there is an upward trend, suggesting that productivity tend to increase as financial incentives increase. Additionally, the hight productivity percent correlate with the highest incentive amount.

p2 <- ggplot(productivity1, aes(x = incentive, y = actual_productivity)) +
labs(title = "Scatterplot of Incentive and Productivity Percentages",
caption = "Data from the UC Irvine Machine Learning Repository",
x = "Amount of financial incentive (in BDT)",
y = "Percentage of Productivity") +
theme_dark(base_size = 12)
p2 + geom_point() +
  geom_smooth(color = "green")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Multiple Linear Regression

For this regression I chose the incentive variable because the previous scatterplot showed a clear relationship between incentive and actual_productivity. Next, I added number of workers because typically the more hands on a task, the faster task get done. Lastly, I selected the over time variable to see if it resulted in productivity or not due to exhaustion.

fit1 <- lm(actual_productivity ~ incentive + no_of_workers  + over_time, data = productivity1)
summary(fit1)
## 
## Call:
## lm(formula = actual_productivity ~ incentive + no_of_workers + 
##     over_time, data = productivity1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33989 -0.03131  0.01246  0.04441  0.30027 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.764e-01  1.998e-02  28.853  < 2e-16 ***
## incentive      4.579e-03  1.260e-04  36.342  < 2e-16 ***
## no_of_workers -4.824e-04  3.913e-04  -1.233 0.218028    
## over_time     -5.040e-06  1.289e-06  -3.909 0.000102 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09072 on 687 degrees of freedom
## Multiple R-squared:  0.658,  Adjusted R-squared:  0.6565 
## F-statistic: 440.5 on 3 and 687 DF,  p-value: < 2.2e-16

Interpretation: Adjusted R Squared : About .6565 meaning about 65.7% of productivity variance is explained by this model

Number of workers and minutes worked over time have the least significant p values. I will be removing these variables and testing out new variables to see if my adjusted R squared will improve.

Backwards Elimination

The incentive variable stayed in the model since it had the smallest p value in the previous regression model. I added targeted productivity it see if high expectations increased productivity. Then I added idle time to see if breaks rejuvenated employees and increased their productivity.

fit2 <- lm(actual_productivity ~ incentive  + targeted_productivity + idle_time, data = productivity1)
summary(fit2)
## 
## Call:
## lm(formula = actual_productivity ~ incentive + targeted_productivity + 
##     idle_time, data = productivity1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39789 -0.02539  0.01049  0.02883  0.35041 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.1297821  0.0211647   6.132 1.46e-09 ***
## incentive              0.0034022  0.0001178  28.878  < 2e-16 ***
## targeted_productivity  0.6093140  0.0317531  19.189  < 2e-16 ***
## idle_time             -0.0001911  0.0001704  -1.122    0.262    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07433 on 687 degrees of freedom
## Multiple R-squared:  0.7704, Adjusted R-squared:  0.7694 
## F-statistic: 768.4 on 3 and 687 DF,  p-value: < 2.2e-16

Interpretation

The variables incentive and targeted productivity both resulted in a statistically significant p value(< 2e-16).

The Adjusted R Squared has improved and is now .7694. Therefore, about 77% of productivity variance is explained by this model.

Final equation: actual_productivity= 0.1297821 + 0.0034022(incentive) + 0.609314(targeted_productivity) - 0.0001911(idle_time)

Conclusion - According to this dataset the strongest predictor of work productivity in the sewing department is the targeted productivity and financial incentives. The coefficients of both are positive showing the higher the target and financial incentive, the more productivity in the sewing department. This makes sense as both can act as motivators for the employees and can encourage them to work efficiently.

References(APA)

Productivity Prediction of Garment Employees [Dataset]. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C51S6D.

Wright, R., & Saeed, S. (2020, April 22). Bangladeshi garment workers face ruin as global brands ditch clothing contracts amid coronavirus pandemic. CNN. https://www.cnn.com/2020/04/22/business/bangladesh-garment-factories