Image of a garment factory in Bangladesh from CNN(Wright & Saeed)
The dataset is from UC Irvine Machine Learning Repository and has various information regarding the productivity of employees in a garment manufacturing companies. There is a total of 1167 observations, 4 categorical variables and 11 quantitative. Originally, the data was published in the International Journal of Business Intelligence and Data Mining. To clean this dataset, I checked for NA values. When I found there were none, I filtered the data to only include information from the sewing department which narrowed down the data to 691 observations. This data is important to me because I can learn what factors most positively effect worker output. This can help make management decisions like hiring more employees, raising the salary, or even reducing the number of employees to increase efficiency.
The variables I will use and their definitions:
Research Question - Which factors best predict actual productivity among sewing garment manufacturing workers?
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(RColorBrewer)
setwd("C:/Users/tonge/Downloads")
productivity <- read_csv("garments_worker_productivity.csv")
## Rows: 1197 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): date, quarter, department, day
## dbl (11): team, targeted_productivity, smv, wip, over_time, incentive, idle_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(productivity)
## # A tibble: 6 × 15
## date quarter department day team targeted_productivity smv wip
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/2015 Quarter1 sweing Thursday 8 0.8 26.2 1108
## 2 1/1/2015 Quarter1 finishing Thursday 1 0.75 3.94 NA
## 3 1/1/2015 Quarter1 sweing Thursday 11 0.8 11.4 968
## 4 1/1/2015 Quarter1 sweing Thursday 12 0.8 11.4 968
## 5 1/1/2015 Quarter1 sweing Thursday 6 0.8 25.9 1170
## 6 1/1/2015 Quarter1 sweing Thursday 7 0.8 25.9 984
## # ℹ 7 more variables: over_time <dbl>, incentive <dbl>, idle_time <dbl>,
## # idle_men <dbl>, no_of_style_change <dbl>, no_of_workers <dbl>,
## # actual_productivity <dbl>
str(productivity)
## spc_tbl_ [1,197 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : chr [1:1197] "1/1/2015" "1/1/2015" "1/1/2015" "1/1/2015" ...
## $ quarter : chr [1:1197] "Quarter1" "Quarter1" "Quarter1" "Quarter1" ...
## $ department : chr [1:1197] "sweing" "finishing" "sweing" "sweing" ...
## $ day : chr [1:1197] "Thursday" "Thursday" "Thursday" "Thursday" ...
## $ team : num [1:1197] 8 1 11 12 6 7 2 3 2 1 ...
## $ targeted_productivity: num [1:1197] 0.8 0.75 0.8 0.8 0.8 0.8 0.75 0.75 0.75 0.75 ...
## $ smv : num [1:1197] 26.16 3.94 11.41 11.41 25.9 ...
## $ wip : num [1:1197] 1108 NA 968 968 1170 ...
## $ over_time : num [1:1197] 7080 960 3660 3660 1920 6720 960 6900 6000 6900 ...
## $ incentive : num [1:1197] 98 0 50 50 50 38 0 45 34 45 ...
## $ idle_time : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
## $ idle_men : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_style_change : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_workers : num [1:1197] 59 8 30.5 30.5 56 56 8 57.5 55 57.5 ...
## $ actual_productivity : num [1:1197] 0.941 0.886 0.801 0.801 0.8 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_character(),
## .. quarter = col_character(),
## .. department = col_character(),
## .. day = col_character(),
## .. team = col_double(),
## .. targeted_productivity = col_double(),
## .. smv = col_double(),
## .. wip = col_double(),
## .. over_time = col_double(),
## .. incentive = col_double(),
## .. idle_time = col_double(),
## .. idle_men = col_double(),
## .. no_of_style_change = col_double(),
## .. no_of_workers = col_double(),
## .. actual_productivity = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Cheching for NAs - No NAs in the variables I will be using
colSums(is.na(productivity))
## date quarter department
## 0 0 0
## day team targeted_productivity
## 0 0 0
## smv wip over_time
## 0 506 0
## incentive idle_time idle_men
## 0 0 0
## no_of_style_change no_of_workers actual_productivity
## 0 0 0
Filtering to only include data on employees within the sewing department
productivity1 <- productivity %>%
filter(department == "sweing")
Percentage of Productivity based on day of the week
Each day of the week fluctuates similarly. The most productive day seems to be Thursday, but they are all pretty close.
highchart() |>
hc_add_series(data = productivity1,
type = "line",
hcaes(y = actual_productivity,
group = day)) |>
hc_yAxis(title = list(text = "Percent of Actual Work Productivity")) |>
hc_caption(text = "Data from the UC Irvine Machine Learning Repository")
Histogram of the actual productivity using R Shiny -
Slightly left skewed, the company mostly experiences a high productivity numbers. The most frequent productivity number is around 80%.
library(shiny)
# Define UI for app that draws a histogram ----
ui <- fluidPage(
# App title ----
titlePanel("Hello Shiny!"),
# Sidebar layout with input and output definitions ----
sidebarLayout(
# Sidebar panel for inputs ----
sidebarPanel(
# Input: Slider for the number of bins ----
sliderInput(inputId = "bins",
label = "Number of bins:",
min = 5,
max = 50,
value = 30)
),
# Main panel for displaying outputs ----
mainPanel(
# Output: Histogram ----
plotOutput(outputId = "distPlot")
)
)
)
server <- function(input, output) {
# Histogram of the Old Faithful Geyser Data ----
# with requested number of bins
# This expression that generates a histogram is wrapped in a call
# to renderPlot to indicate that:
#
# 1. It is "reactive" and therefore should be automatically
# re-executed when inputs (input$bins) change
# 2. Its output type is a plot
output$distPlot <- renderPlot({
x <- productivity1$actual_productivity
bins <- seq(min(x), max(x), length.out = input$bins + 1)
hist(x, breaks = bins, col = "dodgerblue", border = "cyan",
xlab = "Percentages of Productivity",
main = "Histogram of Percentages of Productivity")
})
}
shinyApp(ui = ui, server = server)
Scatterplot testing if financial incentive has any effect on the productivity-
Most of the points populate the low incentive amounts, showing that very few employees receive high incentives. However, there is an upward trend, suggesting that productivity tend to increase as financial incentives increase. Additionally, the hight productivity percent correlate with the highest incentive amount.
p2 <- ggplot(productivity1, aes(x = incentive, y = actual_productivity)) +
labs(title = "Scatterplot of Incentive and Productivity Percentages",
caption = "Data from the UC Irvine Machine Learning Repository",
x = "Amount of financial incentive (in BDT)",
y = "Percentage of Productivity") +
theme_dark(base_size = 12)
p2 + geom_point() +
geom_smooth(color = "green")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
For this regression I chose the incentive variable because the previous scatterplot showed a clear relationship between incentive and actual_productivity. Next, I added number of workers because typically the more hands on a task, the faster task get done. Lastly, I selected the over time variable to see if it resulted in productivity or not due to exhaustion.
fit1 <- lm(actual_productivity ~ incentive + no_of_workers + over_time, data = productivity1)
summary(fit1)
##
## Call:
## lm(formula = actual_productivity ~ incentive + no_of_workers +
## over_time, data = productivity1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33989 -0.03131 0.01246 0.04441 0.30027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.764e-01 1.998e-02 28.853 < 2e-16 ***
## incentive 4.579e-03 1.260e-04 36.342 < 2e-16 ***
## no_of_workers -4.824e-04 3.913e-04 -1.233 0.218028
## over_time -5.040e-06 1.289e-06 -3.909 0.000102 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09072 on 687 degrees of freedom
## Multiple R-squared: 0.658, Adjusted R-squared: 0.6565
## F-statistic: 440.5 on 3 and 687 DF, p-value: < 2.2e-16
Interpretation: Adjusted R Squared : About .6565 meaning about 65.7% of productivity variance is explained by this model
Number of workers and minutes worked over time have the least significant p values. I will be removing these variables and testing out new variables to see if my adjusted R squared will improve.
The incentive variable stayed in the model since it had the smallest p value in the previous regression model. I added targeted productivity it see if high expectations increased productivity. Then I added idle time to see if breaks rejuvenated employees and increased their productivity.
fit2 <- lm(actual_productivity ~ incentive + targeted_productivity + idle_time, data = productivity1)
summary(fit2)
##
## Call:
## lm(formula = actual_productivity ~ incentive + targeted_productivity +
## idle_time, data = productivity1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39789 -0.02539 0.01049 0.02883 0.35041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1297821 0.0211647 6.132 1.46e-09 ***
## incentive 0.0034022 0.0001178 28.878 < 2e-16 ***
## targeted_productivity 0.6093140 0.0317531 19.189 < 2e-16 ***
## idle_time -0.0001911 0.0001704 -1.122 0.262
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07433 on 687 degrees of freedom
## Multiple R-squared: 0.7704, Adjusted R-squared: 0.7694
## F-statistic: 768.4 on 3 and 687 DF, p-value: < 2.2e-16
Interpretation
The variables incentive and targeted productivity both resulted in a statistically significant p value(< 2e-16).
The Adjusted R Squared has improved and is now .7694. Therefore, about 77% of productivity variance is explained by this model.
Final equation: actual_productivity= 0.1297821 + 0.0034022(incentive) + 0.609314(targeted_productivity) - 0.0001911(idle_time)
Conclusion - According to this dataset the strongest predictor of work productivity in the sewing department is the targeted productivity and financial incentives. The coefficients of both are positive showing the higher the target and financial incentive, the more productivity in the sewing department. This makes sense as both can act as motivators for the employees and can encourage them to work efficiently.
Productivity Prediction of Garment Employees [Dataset]. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C51S6D.
Wright, R., & Saeed, S. (2020, April 22). Bangladeshi garment workers face ruin as global brands ditch clothing contracts amid coronavirus pandemic. CNN. https://www.cnn.com/2020/04/22/business/bangladesh-garment-factories